Re: Reading Parquet files with array or list columns

rahul challapalli Fri, 30 Jun 2017 11:39:42 -0700

Like I suggested in the comment for DRILL-5183, can you try using a view as
a workaround until the issue gets resolved?


On Fri, Jun 30, 2017 at 10:41 AM, David Kincaid <[email protected]>
wrote:

> As far as I was able to discern it is not possible to actually use this
> column as an array in Drill at all. It just does not correctly read the
> Parquet. I have had a very similar defect I created in Jira back in January
> that has had no attention at all. So we are moving on to other tools. I
> understand Drill is free and no one developing it owes me anything. It's
> just not going to work for us without proper support for nested objects in
> Parquet format.
>
> Thanks for the reply though. It's much appreciated to have some
> acknowledgment that I raised a valid issue.
>
> - Dave
>
> On Fri, Jun 30, 2017 at 12:06 PM, François Méthot <[email protected]>
> wrote:
>
> > Hi,
> >
> > Have you tried:
> >    select column['list'][0]['element'] from ...
> >        should return "My First Value".
> >
> > or try:
> >     select flatten(column['list'])['element] from ...
> >
> > Hope it helps, in our data we have a column that looks like this:
> > [{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2",
> > "DATA":"thedata2"},.....]
> >
> > We ended doing custom function to do look up instead of doing costly
> > flatten technique.
> >
> > Francois
> >
> >
> >
> > On Sat, Jun 17, 2017 at 10:04 PM, David Kincaid <[email protected]>
> > wrote:
> >
> > > I'm having a problem querying Parquet files that were created from
> Spark
> > > and have columns that are array or list types. When I do a SELECT on
> > these
> > > columns they show up like this:
> > >
> > > {"list": [{"element": "My first value"}, {"element": "My second
> value"}]}
> > >
> > > which Drill does not recognize as a REPEATED column and is not really
> > > workable to hack around like I did in DRILL-5183 (
> > > https://issues.apache.org/jira/browse/DRILL-5183). I can get to one
> > value
> > > using something like t.columnName.`list`.`element` but that's not
> really
> > > feasible to use in a query.
> > >
> > > The little I could find on this by Googling around led me to this
> > document
> > > on the Parquet format Github page -
> > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md.
> > This
> > > seems to say that Spark is writing these files correctly, but Drill is
> > not
> > > interpreting them properly.
> > >
> > > Is there a workaround that anyone can help me to turn these columns
> into
> > > values that Drill understands as repeated values? This is a fairly
> urgent
> > > issue for us.
> > >
> > > Thanks,
> > >
> > > Dave
> > >
> >
>

Re: Reading Parquet files with array or list columns

Reply via email to