Re: FLATTEN doesn't quite give me what I need

Andries Engelbrecht Wed, 30 Sep 2015 11:46:19 -0700

If you want it as columns then you Chris’ suggestion will likely work best. AS 
LDItem is now defined after Flatten and you can reference the ‘column’ using 
the Drill dot notation as Chris shows in his example query.


Would it work for you situation to represent the values as rows? If so maybe do 
something like below, you can add the household, minutes etc to the query for 
the additional columns. This can then be a subquery to be queried with simpler 
rows and columns.

SELECT k.generate.`key` as `column`, k.generate.`value` as `value` from (select 
flatten(kvgen(flatten(t.LDItem))) as generate from `./kvgen.json` t) as k;

+------------------+----------------+
|      column      |     value      |
+------------------+----------------+
| upc              | 0001300000124  |
| item_flags       | 0              |
| units            | 1              |
| retail           | 199            |
| cost             | 130            |
| discount_flags   | 0              |
| discount_units   | 0              |
| discount         | 0              |
| return_units     | 0              |
| return_discount  | 0              |
| upc              | 0003800084505  |
| item_flags       | 0              |
| units            | 1              |
| retail           | 199            |
| cost             | 121            |
| discount_flags   | 1              |
| discount_units   | 1              |
| discount         | 49             |
| return_units     | 0              |
| return_discount  | 0              |
| upc              | 0078533175751  |
| item_flags       | 0              |
| units            | 1              |
| retail           | 369            |
| cost             | 233            |
| discount_flags   | 0              |
| discount_units   | 0              |
| discount         | 0              |
| return_units     | 0              |
| return_discount  | 0              |
| upc              | 0003800084508  |
| item_flags       | 0              |
| units            | 1              |
| retail           | 199            |
| cost             | 120            |
| discount_flags   | 1              |
| discount_units   | 1              |
| discount         | 49             |
| return_units     | 0              |
| return_discount  | 0              |
| upc              | 0001480000373  |
| item_flags       | 0              |
| units            | 1              |
| retail           | 239            |
| cost             | 180            |
| discount_flags   | 0              |
| discount_units   | 0              |
| discount         | 0              |
| return_units     | 0              |
| return_discount  | 0              |
| upc              | 0004200015121  |
| item_flags       | 0              |
| units            | 1              |
| retail           | 299            |
| cost             | 185            |
| discount_flags   | 1              |
| discount_units   | 1              |
| discount         | 30             |
| return_units     | 0              |
| return_discount  | 0              |
| upc              | 0003582608932  |
| item_flags       | 0              |
| units            | 1              |
| retail           | 138            |
| cost             | 79             |
| discount_flags   | 0              |
| discount_units   | 0              |
| discount         | 0              |
| return_units     | 0              |
| return_discount  | 0              |
| upc              | 0003600043220  |
| item_flags       | 0              |
| units            | 1              |
| retail           | 249            |
| cost             | 168            |
| discount_flags   | 1              |
| discount_units   | 1              |
| discount         | 49             |
| return_units     | 0              |
| return_discount  | 0              |
+------------------+————————+

Note that kvgen in Drill 1.1 had an issue with heterogeneous data types and I 
had to use the setting store.json.all_text_mode, which will require some data 
conversion later.

—Andries


> On Sep 30, 2015, at 11:23 AM, Christopher Matta <[email protected]> wrote:
> 
> Have you tried using the flatten query as a sub-query?
> 
> Like:
> 
> SELECT x.`household`, x.`minutes`, x.`LDItem`.`upc`,
> x.`LDItem`.`item_flags` from (SELECT household, minutes,
> FLATTEN(T.lditem_array.LDItem) as `LDItem` FROM
> `dfs`.`root`.`./S/Beddo/spark/pig-xpdb` T limit 8) as x;
> 
> 
> 
> Chris Matta
> [email protected]
> 215-701-3146
> 
> On Wed, Sep 30, 2015 at 2:05 PM, Mike Beddo <[email protected]>
> wrote:
> 
>> I am querying a Parquet file containing records for nested data:
>> 
>> SELECT household, minutes, lditem_array FROM
>> `dfs`.`root`.`./S/Beddo/spark/pig-xpdb` limit 1;
>> 
>> This returns the first record, which contains an array LDItem. For the
>> first record, LDItem is an array of 8 {} structures), like so:
>> 
>> 1       046927725027    1135    {
>>  "LDItem" : [ {
>>    "upc" : "0001300000124",
>>    "item_flags" : 0,
>>    "units" : 1,
>>    "retail" : 199,
>>    "cost" : 130,
>>    "discount_flags" : 0,
>>    "discount_units" : 0,
>>    "discount" : 0,
>>    "return_units" : 0,
>>    "return_discount" : 0
>>  }, {
>>    "upc" : "0003800084505",
>>    "item_flags" : 0,
>>    "units" : 1,
>>    "retail" : 199,
>>    "cost" : 121,
>>    "discount_flags" : 1,
>>    "discount_units" : 1,
>>    "discount" : 49,
>>    "return_units" : 0,
>>    "return_discount" : 0
>>  }, {
>>    "upc" : "0078533175751",
>>    "item_flags" : 0,
>>    "units" : 1,
>>    "retail" : 369,
>>    "cost" : 233,
>>    "discount_flags" : 0,
>>    "discount_units" : 0,
>>    "discount" : 0,
>>    "return_units" : 0,
>>    "return_discount" : 0
>>  }, {
>>    "upc" : "0003800084508",
>>    "item_flags" : 0,
>>    "units" : 1,
>>    "retail" : 199,
>>    "cost" : 120,
>>    "discount_flags" : 1,
>>    "discount_units" : 1,
>>    "discount" : 49,
>>    "return_units" : 0,
>>    "return_discount" : 0
>>  }, {
>>    "upc" : "0001480000373",
>>    "item_flags" : 0,
>>    "units" : 1,
>>    "retail" : 239,
>>    "cost" : 180,
>>    "discount_flags" : 0,
>>    "discount_units" : 0,
>>    "discount" : 0,
>>    "return_units" : 0,
>>    "return_discount" : 0
>>  }, {
>>    "upc" : "0004200015121",
>>    "item_flags" : 0,
>>    "units" : 1,
>>    "retail" : 299,
>>    "cost" : 185,
>>    "discount_flags" : 1,
>>    "discount_units" : 1,
>>    "discount" : 30,
>>    "return_units" : 0,
>>    "return_discount" : 0
>>  }, {
>>    "upc" : "0003582608932",
>>    "item_flags" : 0,
>>    "units" : 1,
>>    "retail" : 138,
>>    "cost" : 79,
>>    "discount_flags" : 0,
>>    "discount_units" : 0,
>>    "discount" : 0,
>>    "return_units" : 0,
>>    "return_discount" : 0
>>  }, {
>>    "upc" : "0003600043220",
>>    "item_flags" : 0,
>>    "units" : 1,
>>    "retail" : 249,
>>    "cost" : 168,
>>    "discount_flags" : 1,
>>    "discount_units" : 1,
>>    "discount" : 49,
>>    "return_units" : 0,
>>    "return_discount" : 0
>>  } ]
>> }
>> 
>> I would like to flatten this 1 record into 8 records, with columns for
>> upc, item_flags, units, retail, cost, ... The furthest I got was with this
>> query:
>> 
>> SELECT household, minutes, FLATTEN(T.lditem_array.LDItem) FROM
>> `dfs`.`root`.`./S/Beddo/spark/pig-xpdb` T limit 8;
>> 
>> This flattened the first record, but did not yielded new columns upc,
>> item_flags, units, ... Instead, it gave me one record for each {} element,
>> which I've not been able to split further into columns:
>> 
>> 1       046927725027    1135    {
>>  "upc" : "0001300000124",
>>  "item_flags" : 0,
>>  "units" : 1,
>>  "retail" : 199,
>>  "cost" : 130,
>>  "discount_flags" : 0,
>>  "discount_units" : 0,
>>  "discount" : 0,
>>  "return_units" : 0,
>>  "return_discount" : 0
>> }
>> 2       046927725027    1135    {
>>  "upc" : "0003800084505",
>>  "item_flags" : 0,
>>  "units" : 1,
>>  "retail" : 199,
>>  "cost" : 121,
>>  "discount_flags" : 1,
>>  "discount_units" : 1,
>>  "discount" : 49,
>>  "return_units" : 0,
>>  "return_discount" : 0
>> }
>> 3       046927725027    1135    {
>>  "upc" : "0078533175751",
>>  "item_flags" : 0,
>>  "units" : 1,
>>  "retail" : 369,
>>  "cost" : 233,
>>  "discount_flags" : 0,
>>  "discount_units" : 0,
>>  "discount" : 0,
>>  "return_units" : 0,
>>  "return_discount" : 0
>> }
>> 4       046927725027    1135    {
>>  "upc" : "0003800084508",
>>  "item_flags" : 0,
>>  "units" : 1,
>>  "retail" : 199,
>>  "cost" : 120,
>>  "discount_flags" : 1,
>>  "discount_units" : 1,
>>  "discount" : 49,
>>  "return_units" : 0,
>>  "return_discount" : 0
>> }
>> 5       046927725027    1135    {
>>  "upc" : "0001480000373",
>>  "item_flags" : 0,
>>  "units" : 1,
>>  "retail" : 239,
>>  "cost" : 180,
>>  "discount_flags" : 0,
>>  "discount_units" : 0,
>>  "discount" : 0,
>>  "return_units" : 0,
>>  "return_discount" : 0
>> }
>> 6       046927725027    1135    {
>>  "upc" : "0004200015121",
>>  "item_flags" : 0,
>>  "units" : 1,
>>  "retail" : 299,
>>  "cost" : 185,
>>  "discount_flags" : 1,
>>  "discount_units" : 1,
>>  "discount" : 30,
>>  "return_units" : 0,
>>  "return_discount" : 0
>> }
>> 7       046927725027    1135    {
>>  "upc" : "0003582608932",
>>  "item_flags" : 0,
>>  "units" : 1,
>>  "retail" : 138,
>>  "cost" : 79,
>>  "discount_flags" : 0,
>>  "discount_units" : 0,
>>  "discount" : 0,
>>  "return_units" : 0,
>>  "return_discount" : 0
>> }
>> 8       046927725027    1135    {
>>  "upc" : "0003600043220",
>>  "item_flags" : 0,
>>  "units" : 1,
>>  "retail" : 249,
>>  "cost" : 168,
>>  "discount_flags" : 1,
>>  "discount_units" : 1,
>>  "discount" : 49,
>>  "return_units" : 0,
>>  "return_discount" : 0
>> }
>> 
>> How can I get a result set with 12 columns, i.e. split each {...} list
>> into columns? This is where I am stuck.
>> 
>> Thanks,
>> 
>> Mike Beddo
>> 
>>

Re: FLATTEN doesn't quite give me what I need

Reply via email to