Re: Querying k/v pairs in a CSV column

Christopher Matta Mon, 23 Nov 2015 19:17:32 -0800

An illustration of my previous email:

Original csv with a varN=value column:


0: jdbc:drill:> select * from `kv.csv`;
+-----------------------------------------------------------+
|                          columns                          |
+-----------------------------------------------------------+
| ["abc123","chris matta","var1=test var2=test var3=test"]  |
+-----------------------------------------------------------+
1 row selected (0.554 seconds)

Convert it to valid JSON with a creative set of regular expressions:

0: jdbc:drill:> select `columns`[0], CONVERT_FROM('{"' ||
regexp_replace(regexp_replace(`columns`[2], '\s', '","'), '=', '": "')
|| '"}', 'JSON') from `kv.csv`;
+---------+----------------------------------------------+
| EXPR$0  |                    EXPR$1                    |
+---------+----------------------------------------------+
| abc123  | {"var1":"test","var2":"test","var3":"test"}  |
+---------+----------------------------------------------+

Then you can use flatten with kvgen:

0: jdbc:drill:> select t.`id`, flatten(kvgen(t.`values`)) from (select
`columns`[0] `id`, CONVERT_FROM('{"' ||
regexp_replace(regexp_replace(`columns`[2], '\s', '","'), '=', '": "')
|| '"}', 'JSON') `values` from `kv.csv`) t;
+---------+--------------------------------+
|   id    |             EXPR$1             |
+---------+--------------------------------+
| abc123  | {"key":"var1","value":"test"}  |
| abc123  | {"key":"var2","value":"test"}  |
| abc123  | {"key":"var3","value":"test"}  |
+---------+--------------------------------+
3 rows selected (0.597 seconds)



Chris Matta
[email protected]
215-701-3146

On Mon, Nov 23, 2015 at 10:09 PM, Christopher Matta <[email protected]> wrote:

> From a suggestion in a separate thread I did something like this to
> convert my delimited substring into a JSON array, this allowed me to use
> FLATTEN:
>
> 0: jdbc:drill:> select `columns`[0] `id`, `columns`[16] `classes` FROM 
> `maprfs.cmatta`.`product.tsv` limit 10;
> +-------------------------------------------------+------------------------------------------------------------------------------------------------+
> |                       id                        |                           
>                  classes                                             |
> +-------------------------------------------------+------------------------------------------------------------------------------------------------+
> | 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Radioactive Diagnostic 
> Agent [EPC],Positron Emitting Activity [MoA]                            |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Antiarrhythmic 
> [EPC],Cytochrome P450 2D6 Inhibitor [EPC],Cytochrome P450 2D6 Inhibitors 
> [MoA]  |
> | 0002-1433_4137e5fc-0d0b-4171-842d-db80973d1a36  |                           
>                                                                      |
> | 0002-1434_4137e5fc-0d0b-4171-842d-db80973d1a36  |                           
>                                                                      |
> | 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen [EPC],Androgen 
> Receptor Agonists [MoA],Androstanes [Chemical/Ingredient]              |
> | 0002-3004_77661866-2f7c-4b90-8866-dce57ebb87bb  | Serotonin Reuptake 
> Inhibitor [EPC],Serotonin Uptake Inhibitors [MoA]                           |
> | 0002-3227_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine Reuptake 
> Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]                 |
> | 0002-3228_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine Reuptake 
> Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]                 |
> | 0002-3229_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine Reuptake 
> Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]                 |
> +-------------------------------------------------+------------------------------------------------------------------------------------------------+
> 10 rows selected (0.545 seconds)
> 0: jdbc:drill:> select `columns`[0] `id`, FLATTEN(CONVERT_FROM('["' || 
> REGEXP_REPLACE(`columns`[16],',','","') || '"]','JSON')) `classes` FROM 
> `maprfs.cmatta`.`product.tsv` limit 10;
> +-------------------------------------------------+---------------------------------------+
> |                       id                        |                classes    
>             |
> +-------------------------------------------------+---------------------------------------+
> | 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Radioactive Diagnostic 
> Agent [EPC]    |
> | 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Positron Emitting 
> Activity [MoA]      |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Antiarrhythmic [EPC]      
>             |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Cytochrome P450 2D6 
> Inhibitor [EPC]   |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Cytochrome P450 2D6 
> Inhibitors [MoA]  |
> | 0002-1433_4137e5fc-0d0b-4171-842d-db80973d1a36  |                           
>             |
> | 0002-1434_4137e5fc-0d0b-4171-842d-db80973d1a36  |                           
>             |
> | 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen [EPC]            
>             |
> | 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen Receptor 
> Agonists [MoA]      |
> +-------------------------------------------------+---------------------------------------+
> 10 rows selected (0.876 seconds)
> 0: jdbc:drill:>
>
> If you could come up with a regexp and string formatting to turn your 
> var1=test
> var2=test var3=test column into valid JSON you could use CONVERT_FROM as
> Tomer suggested.
> 
>
> Chris Matta
> [email protected]
> 215-701-3146
>
> On Sat, Nov 21, 2015 at 6:38 AM, Christopher Matta <[email protected]>
> wrote:
>
>> Along the same lines I have a data set that's delimited by pipe with the
>> last column further delimited by commas. It would be great to be able to
>> flatten that last column.
>>
>>
>> On Wednesday, October 7, 2015, Tugdual Grall <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Can you clarify how to use our own format in CONVERT_FROM?
>>>
>>> 1- is it supported out of the box ? (cannot find that in the doc -
>>>
>>> https://drill.apache.org/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions
>>> - may be not looking at the proper place.
>>>
>>> 2- it is another extension point? so developer can create their own
>>> parser
>>> and deploy it?
>>>
>>>
>>>
>>> Thanks
>>>
>>> On Wed, Oct 7, 2015 at 4:55 PM, Tomer Shiran <[email protected]> wrote:
>>>
>>> > I think it would be much better to leverage something like
>>> > CONVERT_FROM(col, 'MyKeyValueFormat') where the entire key/value cell
>>> is
>>> > parsed into Drill's internal structure just like we do with JSON. That
>>> > exposes all those key/value pairs such that you can leverage them in
>>> other
>>> > SQL operators.
>>> >
>>> > On Wed, Oct 7, 2015 at 7:32 AM, Tugdual Grall <[email protected]>
>>> wrote:
>>> >
>>> >> You can create a custom function to do that:
>>> >> - https://drill.apache.org/docs/develop-custom-functions/
>>> >>
>>> >> this will end with something like:
>>> >>  select *
>>> >>  from *.csv
>>> >>  where GET_VALUE(col[x] , 'var1' ) = 'value'
>>> >>
>>> >> (I will publish, in the next few day, a similar function parsing URL
>>> in
>>> >> values in this repo :
>>> >> https://github.com/mapr-demos/simple-drill-functions )
>>> >>
>>> >> t
>>> >>
>>> >> On Wed, Oct 7, 2015 at 4:22 PM, <[email protected]> wrote:
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > If I have a separated file (PSV/TSV etc) with a number of columns,
>>> each
>>> >> > with a single value aside from the last column which contains a
>>> bunch of
>>> >> > key value pairs in a string e.g. var1=test var2=test var3=test. Is
>>> it
>>> >> > possible to query based on individual key/value items in the last
>>> >> column? I
>>> >> > am looking to querying nested data without using LIKE wildcards in
>>> where
>>> >> > statements (I am not sure this is possible with separated text
>>> data).
>>> >> >
>>> >> > Does anyone have any advice on whether this is possible?
>>> >> >
>>> >> > Thanks,
>>> >> > Mike
>>> >> >
>>> >> >
>>> >> > This e-mail (including any attachments) is private and
>>> confidential, may
>>> >> > contain proprietary or privileged information and is intended for
>>> the
>>> >> named
>>> >> > recipient(s) only. Unintended recipients are strictly prohibited
>>> from
>>> >> > taking action on the basis of information in this e-mail and must
>>> >> contact
>>> >> > the sender immediately, delete this e-mail (and all attachments) and
>>> >> > destroy any hard copies. Nomura will not accept responsibility or
>>> >> liability
>>> >> > for the accuracy or completeness of, or the presence of any virus or
>>> >> > disabling code in, this e-mail. If verification is sought please
>>> >> request a
>>> >> > hard copy. Any reference to the terms of executed transactions
>>> should be
>>> >> > treated as preliminary only and subject to formal written
>>> confirmation
>>> >> by
>>> >> > Nomura. Nomura reserves the right to retain, monitor and intercept
>>> >> e-mail
>>> >> > communications through its networks (subject to and in accordance
>>> with
>>> >> > applicable laws). No confidentiality or privilege is waived or lost
>>> by
>>> >> > Nomura by any mistransmission of this e-mail. Any reference to
>>> "Nomura"
>>> >> is
>>> >> > a reference to any entity in the Nomura Holdings, Inc. group. Please
>>> >> read
>>> >> > our Electronic Communications Legal Notice which forms part of this
>>> >> e-mail:
>>> >> > http://www.Nomura.com/email_disclaimer.htm
>>> >> >
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Tomer Shiran
>>> > CEO and Co-Founder, Dremio
>>> >
>>>
>>
>>
>> --
>> Chris Matta
>> [email protected]
>> 215-701-3146
>>
>>
>

Re: Querying k/v pairs in a CSV column

Reply via email to