Re: CTAS query fails

Paul Rogers Mon, 21 Sep 2020 13:56:47 -0700

Hi Vimal,

One thing to consider is that if you do have variable schema, you may be
presenting Parquet with a Drill feature which Parquet cannot support.
Parquet appears to require that the schema be known when creating the file.
In Drill-speak, this means that the batch used to create a Parquet file
will define the schema. If another batch comes along later with a different
schema, there is no way to go back and revise the Parquet schema.


That is, there is an impedance mismatch between the fact that (for a subset
of operators) Drill allows the schema to vary from one batch of records to
the next. However, Parquet (or JDBC or ODBC) requires that the schema be
known up-front.

In your case, this shows up as that JSON object (Drill MAP) with a varying
set of elements.

Drill provides no way to bridge this gap. The ability to have ill-defined
schemas is seen as a "feature" of Drill, not a bug.

The best solution is to do an ETL step to normalize the data before running
it through Drill. That way, although Drill does allow the schema to change,
it won't, in fact, change and so the Parquet writer will be happy.

Thanks,

- Paul


On Sun, Sep 20, 2020 at 11:50 PM Vimal Jain <[email protected]> wrote:

> Thanks Paul for quick response.
> So reading your response, looks like this has something to do would Parquet
> instead of Drill ? I would post this question in the Parquet community
> group as well to see if we can get an answer for this.
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Fri, Sep 18, 2020 at 10:45 PM Paul Rogers <[email protected]> wrote:
>
> > Hi Vimal,
> >
> > You've stumbled across one of the more frustrating bits of Drill. Drill
> is
> > "schema-free", meaning that the only information which Drill has to read
> > your data is the data itself. In your case, the JSON reader can infer
> that
> > "abc" is a MAP (Drill's term, Hive would call it a STRUCT.) Each file is
> > read in a different "fragment". One fragment says that "abc" is an empty
> > MAP, another says that it has some schema. These are merged sometime
> later
> > in the query.
> >
> > If you had had a null value instead, Drill won't know that "abc" is a map
> > and would have guessed INT as the type. So, good that you have an empty
> > object, it avoids ambiguity.
> >
> > Sounds like the issue is in the Parquet writer: that it has some
> limitation
> > on an empty group. Why is the group empty? Because, when writing the
> first
> > file with the empty group, the Parquet writer has no way to predict that
> > your "abc" field will eventually include a non-empty group. In fact, when
> > the non-empty group does appear, the Parquet schema must change. Not sure
> > what Parquet will do in that case: you may end up with some files with
> one
> > schema, other files with another schema.
> >
> > What you want, of course, is for Drill to combine your files to create a
> > single schema for Parquet, setting fields to null when they are missing.
> > Drill can't currently do that effectively because it involves predicting
> > the future, which Drill cannot do.
> >
> > Does anyone have more direct knowledge of how Parquet handles this case?
> >
> > Thanks,
> >
> > - Paul
> >
> > On Fri, Sep 18, 2020 at 4:10 AM Vimal Jain <[email protected]> wrote:
> >
> > > Hi,
> > > I am trying to convert my JSON data into Parquet format using CTAS
> query
> > > like below :-
> > >
> > > *create table ds2.root.`parquetOutput` as select * from
> > > TABLE(ds1.root.`jsonInput/` (type =>'json'));*
> > >
> > > But it fails with error :-
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema
> with
> > an
> > > empty group: optional group abc {}Fragment 0:0Please, refer to logs for
> > > more information.[Error Id: fa3c0390-0093-4c4a-9b32-098d5cc68c7e on
> > > ip-172-30-3-153.ec2.internal:31010] (state=,code=0)*
> > >
> > > So can someone explain what is the issue here, can't my jsons have a
> key
> > > "abc" with value as empty object "{}" ?
> > > It's empty in some json files in ds1 but in some there is a value.
> > > Any help to resolve this would be appreciated.
> > >
> > > *Thanks and Regards,*
> > > *Vimal Jain*
> > >
> >
>

Re: CTAS query fails

Reply via email to