Hmm....I too see no simple workaround for the second case. Can you also file a jira for the CTAS case? Drill could have been running short on heap memory.
- Rahul On Fri, Jun 30, 2017 at 11:46 AM, David Kincaid <[email protected]> wrote: > The view only works for the first example in the Jira I created. That was > the workaround we have been using since January. > > Recently we've had a use case where we are running a Spark script to > pre-join some data before we try to use it in Drill. That was the subject > of the initial e-mail in this thread and the topic of the comment I made in > the JIra on 6/17. As far as I've been able to tell there isn't a similar > work around for this case that will make the column appear as an array. > > Note, I tried to use Drill to do that pre-join of the Parquet data using > CTAS, but it ran for about 4 hours then crashed. The Spark script to do it > runs in 14 minutes successfully. > > - Dave > > On Fri, Jun 30, 2017 at 1:38 PM, rahul challapalli < > [email protected]> wrote: > > > Like I suggested in the comment for DRILL-5183, can you try using a view > as > > a workaround until the issue gets resolved? > > > > On Fri, Jun 30, 2017 at 10:41 AM, David Kincaid <[email protected]> > > wrote: > > > > > As far as I was able to discern it is not possible to actually use this > > > column as an array in Drill at all. It just does not correctly read the > > > Parquet. I have had a very similar defect I created in Jira back in > > January > > > that has had no attention at all. So we are moving on to other tools. I > > > understand Drill is free and no one developing it owes me anything. > It's > > > just not going to work for us without proper support for nested objects > > in > > > Parquet format. > > > > > > Thanks for the reply though. It's much appreciated to have some > > > acknowledgment that I raised a valid issue. > > > > > > - Dave > > > > > > On Fri, Jun 30, 2017 at 12:06 PM, François Méthot <[email protected] > > > > > wrote: > > > > > > > Hi, > > > > > > > > Have you tried: > > > > select column['list'][0]['element'] from ... > > > > should return "My First Value". > > > > > > > > or try: > > > > select flatten(column['list'])['element] from ... > > > > > > > > Hope it helps, in our data we have a column that looks like this: > > > > [{"NAME:":"Aname", "DATA":"thedata"},{"NAME:":"Aname2", > > > > "DATA":"thedata2"},.....] > > > > > > > > We ended doing custom function to do look up instead of doing costly > > > > flatten technique. > > > > > > > > Francois > > > > > > > > > > > > > > > > On Sat, Jun 17, 2017 at 10:04 PM, David Kincaid < > > [email protected]> > > > > wrote: > > > > > > > > > I'm having a problem querying Parquet files that were created from > > > Spark > > > > > and have columns that are array or list types. When I do a SELECT > on > > > > these > > > > > columns they show up like this: > > > > > > > > > > {"list": [{"element": "My first value"}, {"element": "My second > > > value"}]} > > > > > > > > > > which Drill does not recognize as a REPEATED column and is not > really > > > > > workable to hack around like I did in DRILL-5183 ( > > > > > https://issues.apache.org/jira/browse/DRILL-5183). I can get to > one > > > > value > > > > > using something like t.columnName.`list`.`element` but that's not > > > really > > > > > feasible to use in a query. > > > > > > > > > > The little I could find on this by Googling around led me to this > > > > document > > > > > on the Parquet format Github page - > > > > > https://github.com/apache/parquet-format/blob/master/ > LogicalTypes.md > > . > > > > This > > > > > seems to say that Spark is writing these files correctly, but Drill > > is > > > > not > > > > > interpreting them properly. > > > > > > > > > > Is there a workaround that anyone can help me to turn these columns > > > into > > > > > values that Drill understands as repeated values? This is a fairly > > > urgent > > > > > issue for us. > > > > > > > > > > Thanks, > > > > > > > > > > Dave > > > > > > > > > > > > > > >
