I'm querying parquet data which contains large numbers of empty maps/arrays
that I'd like to eliminate from the query as early as possible. The events I'm
querying are very sparse in the data.
> SELECT par.check.detections.list AS detections FROM
> `/my.parquet/year=2016/month=5` par LIMIT 10;
10 rows selected (3.157 seconds)
+-----------------------------+
| {"detections":{"list":[]}} |
+-----------------------------+
I've tried to use the map directly but the query fails.
> SELECT par.check AS check
> FROM `/my.parquet/year=2016/month=5` par
> WHERE par.check <> '{"detections":{"list":[]}}' ;
Error in expression at index -1. Error: Missing function implementation:
[equal(MAP-REQUIRED, VARCHAR-REQUIRED)]. Full expression: --UNKNOWN
EXPRESSION--..
I get a similar result if I reference down to the empty array.
> SELECT par.check.detections.list AS detections
> FROM `/my.parquet/year=2016/month=5` par
> WHERE par.check.detections.list <> '[]' ;
Error in expression at index -1. Error: Missing function implementation:
[equal(MAP-REPEATED, VARCHAR-REQUIRED)]. Full expression: --UNKNOWN
EXPRESSION--..
I'm also unable to refer to an array element without using FLATTEN().
> SELECT par.check.detections.list[0]
> FROM `/my.parquet/year=2016/month=5` par;
Error: Unexpected RuntimeException: java.lang.NullPointerException
(state=,code=0)
When I use FLATTEN() it automatically eliminates empty arrays and query time
goes way up. It seems to be doing quite a bit of work on each empty array
before eliminating it.
> SELECT FLATTEN(par.check.detections.list) AS detections
> FROM `/my.parquet/year=2016/month=5` par
> LIMIT 10;
No rows selected (290.475 seconds)
I can see in the docs that map and array are internal only data types that
aren't exposed to the user
(https://drill.apache.org/docs/supported-data-types/#composite-types). Is this
on the roadmap somewhere to be added?
Thanks,
Joe