If you want it as columns then you Chris’ suggestion will likely work best. AS LDItem is now defined after Flatten and you can reference the ‘column’ using the Drill dot notation as Chris shows in his example query.
Would it work for you situation to represent the values as rows? If so maybe do something like below, you can add the household, minutes etc to the query for the additional columns. This can then be a subquery to be queried with simpler rows and columns. SELECT k.generate.`key` as `column`, k.generate.`value` as `value` from (select flatten(kvgen(flatten(t.LDItem))) as generate from `./kvgen.json` t) as k; +------------------+----------------+ | column | value | +------------------+----------------+ | upc | 0001300000124 | | item_flags | 0 | | units | 1 | | retail | 199 | | cost | 130 | | discount_flags | 0 | | discount_units | 0 | | discount | 0 | | return_units | 0 | | return_discount | 0 | | upc | 0003800084505 | | item_flags | 0 | | units | 1 | | retail | 199 | | cost | 121 | | discount_flags | 1 | | discount_units | 1 | | discount | 49 | | return_units | 0 | | return_discount | 0 | | upc | 0078533175751 | | item_flags | 0 | | units | 1 | | retail | 369 | | cost | 233 | | discount_flags | 0 | | discount_units | 0 | | discount | 0 | | return_units | 0 | | return_discount | 0 | | upc | 0003800084508 | | item_flags | 0 | | units | 1 | | retail | 199 | | cost | 120 | | discount_flags | 1 | | discount_units | 1 | | discount | 49 | | return_units | 0 | | return_discount | 0 | | upc | 0001480000373 | | item_flags | 0 | | units | 1 | | retail | 239 | | cost | 180 | | discount_flags | 0 | | discount_units | 0 | | discount | 0 | | return_units | 0 | | return_discount | 0 | | upc | 0004200015121 | | item_flags | 0 | | units | 1 | | retail | 299 | | cost | 185 | | discount_flags | 1 | | discount_units | 1 | | discount | 30 | | return_units | 0 | | return_discount | 0 | | upc | 0003582608932 | | item_flags | 0 | | units | 1 | | retail | 138 | | cost | 79 | | discount_flags | 0 | | discount_units | 0 | | discount | 0 | | return_units | 0 | | return_discount | 0 | | upc | 0003600043220 | | item_flags | 0 | | units | 1 | | retail | 249 | | cost | 168 | | discount_flags | 1 | | discount_units | 1 | | discount | 49 | | return_units | 0 | | return_discount | 0 | +------------------+————————+ Note that kvgen in Drill 1.1 had an issue with heterogeneous data types and I had to use the setting store.json.all_text_mode, which will require some data conversion later. —Andries > On Sep 30, 2015, at 11:23 AM, Christopher Matta <[email protected]> wrote: > > Have you tried using the flatten query as a sub-query? > > Like: > > SELECT x.`household`, x.`minutes`, x.`LDItem`.`upc`, > x.`LDItem`.`item_flags` from (SELECT household, minutes, > FLATTEN(T.lditem_array.LDItem) as `LDItem` FROM > `dfs`.`root`.`./S/Beddo/spark/pig-xpdb` T limit 8) as x; > > > > Chris Matta > [email protected] > 215-701-3146 > > On Wed, Sep 30, 2015 at 2:05 PM, Mike Beddo <[email protected]> > wrote: > >> I am querying a Parquet file containing records for nested data: >> >> SELECT household, minutes, lditem_array FROM >> `dfs`.`root`.`./S/Beddo/spark/pig-xpdb` limit 1; >> >> This returns the first record, which contains an array LDItem. For the >> first record, LDItem is an array of 8 {} structures), like so: >> >> 1 046927725027 1135 { >> "LDItem" : [ { >> "upc" : "0001300000124", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 199, >> "cost" : 130, >> "discount_flags" : 0, >> "discount_units" : 0, >> "discount" : 0, >> "return_units" : 0, >> "return_discount" : 0 >> }, { >> "upc" : "0003800084505", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 199, >> "cost" : 121, >> "discount_flags" : 1, >> "discount_units" : 1, >> "discount" : 49, >> "return_units" : 0, >> "return_discount" : 0 >> }, { >> "upc" : "0078533175751", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 369, >> "cost" : 233, >> "discount_flags" : 0, >> "discount_units" : 0, >> "discount" : 0, >> "return_units" : 0, >> "return_discount" : 0 >> }, { >> "upc" : "0003800084508", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 199, >> "cost" : 120, >> "discount_flags" : 1, >> "discount_units" : 1, >> "discount" : 49, >> "return_units" : 0, >> "return_discount" : 0 >> }, { >> "upc" : "0001480000373", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 239, >> "cost" : 180, >> "discount_flags" : 0, >> "discount_units" : 0, >> "discount" : 0, >> "return_units" : 0, >> "return_discount" : 0 >> }, { >> "upc" : "0004200015121", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 299, >> "cost" : 185, >> "discount_flags" : 1, >> "discount_units" : 1, >> "discount" : 30, >> "return_units" : 0, >> "return_discount" : 0 >> }, { >> "upc" : "0003582608932", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 138, >> "cost" : 79, >> "discount_flags" : 0, >> "discount_units" : 0, >> "discount" : 0, >> "return_units" : 0, >> "return_discount" : 0 >> }, { >> "upc" : "0003600043220", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 249, >> "cost" : 168, >> "discount_flags" : 1, >> "discount_units" : 1, >> "discount" : 49, >> "return_units" : 0, >> "return_discount" : 0 >> } ] >> } >> >> I would like to flatten this 1 record into 8 records, with columns for >> upc, item_flags, units, retail, cost, ... The furthest I got was with this >> query: >> >> SELECT household, minutes, FLATTEN(T.lditem_array.LDItem) FROM >> `dfs`.`root`.`./S/Beddo/spark/pig-xpdb` T limit 8; >> >> This flattened the first record, but did not yielded new columns upc, >> item_flags, units, ... Instead, it gave me one record for each {} element, >> which I've not been able to split further into columns: >> >> 1 046927725027 1135 { >> "upc" : "0001300000124", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 199, >> "cost" : 130, >> "discount_flags" : 0, >> "discount_units" : 0, >> "discount" : 0, >> "return_units" : 0, >> "return_discount" : 0 >> } >> 2 046927725027 1135 { >> "upc" : "0003800084505", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 199, >> "cost" : 121, >> "discount_flags" : 1, >> "discount_units" : 1, >> "discount" : 49, >> "return_units" : 0, >> "return_discount" : 0 >> } >> 3 046927725027 1135 { >> "upc" : "0078533175751", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 369, >> "cost" : 233, >> "discount_flags" : 0, >> "discount_units" : 0, >> "discount" : 0, >> "return_units" : 0, >> "return_discount" : 0 >> } >> 4 046927725027 1135 { >> "upc" : "0003800084508", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 199, >> "cost" : 120, >> "discount_flags" : 1, >> "discount_units" : 1, >> "discount" : 49, >> "return_units" : 0, >> "return_discount" : 0 >> } >> 5 046927725027 1135 { >> "upc" : "0001480000373", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 239, >> "cost" : 180, >> "discount_flags" : 0, >> "discount_units" : 0, >> "discount" : 0, >> "return_units" : 0, >> "return_discount" : 0 >> } >> 6 046927725027 1135 { >> "upc" : "0004200015121", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 299, >> "cost" : 185, >> "discount_flags" : 1, >> "discount_units" : 1, >> "discount" : 30, >> "return_units" : 0, >> "return_discount" : 0 >> } >> 7 046927725027 1135 { >> "upc" : "0003582608932", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 138, >> "cost" : 79, >> "discount_flags" : 0, >> "discount_units" : 0, >> "discount" : 0, >> "return_units" : 0, >> "return_discount" : 0 >> } >> 8 046927725027 1135 { >> "upc" : "0003600043220", >> "item_flags" : 0, >> "units" : 1, >> "retail" : 249, >> "cost" : 168, >> "discount_flags" : 1, >> "discount_units" : 1, >> "discount" : 49, >> "return_units" : 0, >> "return_discount" : 0 >> } >> >> How can I get a result set with 12 columns, i.e. split each {...} list >> into columns? This is where I am stuck. >> >> Thanks, >> >> Mike Beddo >> >>
