Just spotted a typo in the email below, the table structure is: id : STRING host_details : ARRAY<STRUCT<key : STRING, value : STRING>> metrics : ARRAY<STRUCT<key : STRING, value : STRING>>
I.e. value is a STRING, not s STRUCT. Thanks, Tom On Mon, 30 Apr 2018 at 09:13, Tom Davis <mailinglists...@gmail.com> wrote: > Hi all, > > > I have a table with a lot of logging / metrics data that looks like this: > > > id: STRING > > host_details: ARRAY<STRUCT<key: STRING, value: STRUCT>> > > metrics: ARRAY<STRUCT<key: STRING, value: STRUCT>> > > > I would like to be able to extract known keys and their values (from each > of the host_details and metrics arrays-of-struct). Yes, this format sucks > and could be a map, but technically the keys aren’t unique, although I am > happy to make that compromise at this point. In reality the table has more > fields in the struct. > > > In Spark or MR, I could extract all the fields I need and project them to > columns with only a single pass over the table. I’m struggling to achieve > the same in Hive, which may be either my SQL ability or that there isn’t a > UDF available to meet my needs. > > > The most efficient approach I have found so far, uses the collect() > Brickhouse UDF And two lateral views generated from the inline() UDTF: > > > SELECT > > id, > > collect(t.key, t..value) AS map_h_m > > FROM > > (SELECT id, hds.key AS key, hds.value AS value > > FROM metrics > > LATERAL VIEW inline(metrics.host_details) hds > > UNION ALL > > SELECT id, ms.key AS key, ms.value AS value > > FROM metrics > > LATERAL VIEW inline(metrics.metrics)) t > > GROUP BY t.id > > > I can create this as a VIEW and then extract columns from the map using > the standard map_h_m[‘error_count’] notation. > > > Using collect() / GROUP BY seems more efficient than doing a join by id (I > think it cuts out a shuffle), but still has problems: > > - it’s still scanning the metrics table multiple times > - It assumes that keys are unique across both host_details and > metrics. > > > Does anyone have any thoughts before I attempt to write my own UDF(s) to > assist? > > > Many thanks, > > Tom >