An illustration of my previous email:
Original csv with a varN=value column:
0: jdbc:drill:> select * from `kv.csv`;
+-----------------------------------------------------------+
| columns |
+-----------------------------------------------------------+
| ["abc123","chris matta","var1=test var2=test var3=test"] |
+-----------------------------------------------------------+
1 row selected (0.554 seconds)
Convert it to valid JSON with a creative set of regular expressions:
0: jdbc:drill:> select `columns`[0], CONVERT_FROM('{"' ||
regexp_replace(regexp_replace(`columns`[2], '\s', '","'), '=', '": "')
|| '"}', 'JSON') from `kv.csv`;
+---------+----------------------------------------------+
| EXPR$0 | EXPR$1 |
+---------+----------------------------------------------+
| abc123 | {"var1":"test","var2":"test","var3":"test"} |
+---------+----------------------------------------------+
Then you can use flatten with kvgen:
0: jdbc:drill:> select t.`id`, flatten(kvgen(t.`values`)) from (select
`columns`[0] `id`, CONVERT_FROM('{"' ||
regexp_replace(regexp_replace(`columns`[2], '\s', '","'), '=', '": "')
|| '"}', 'JSON') `values` from `kv.csv`) t;
+---------+--------------------------------+
| id | EXPR$1 |
+---------+--------------------------------+
| abc123 | {"key":"var1","value":"test"} |
| abc123 | {"key":"var2","value":"test"} |
| abc123 | {"key":"var3","value":"test"} |
+---------+--------------------------------+
3 rows selected (0.597 seconds)
Chris Matta
[email protected]
215-701-3146
On Mon, Nov 23, 2015 at 10:09 PM, Christopher Matta <[email protected]> wrote:
> From a suggestion in a separate thread I did something like this to
> convert my delimited substring into a JSON array, this allowed me to use
> FLATTEN:
>
> 0: jdbc:drill:> select `columns`[0] `id`, `columns`[16] `classes` FROM
> `maprfs.cmatta`.`product.tsv` limit 10;
> +-------------------------------------------------+------------------------------------------------------------------------------------------------+
> | id |
> classes |
> +-------------------------------------------------+------------------------------------------------------------------------------------------------+
> | 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc | Radioactive Diagnostic
> Agent [EPC],Positron Emitting Activity [MoA] |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589 | Antiarrhythmic
> [EPC],Cytochrome P450 2D6 Inhibitor [EPC],Cytochrome P450 2D6 Inhibitors
> [MoA] |
> | 0002-1433_4137e5fc-0d0b-4171-842d-db80973d1a36 |
> |
> | 0002-1434_4137e5fc-0d0b-4171-842d-db80973d1a36 |
> |
> | 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089 | Androgen [EPC],Androgen
> Receptor Agonists [MoA],Androstanes [Chemical/Ingredient] |
> | 0002-3004_77661866-2f7c-4b90-8866-dce57ebb87bb | Serotonin Reuptake
> Inhibitor [EPC],Serotonin Uptake Inhibitors [MoA] |
> | 0002-3227_7ad983d7-6ac4-4560-a5a7-0c9151cd3831 | Norepinephrine Reuptake
> Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA] |
> | 0002-3228_7ad983d7-6ac4-4560-a5a7-0c9151cd3831 | Norepinephrine Reuptake
> Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA] |
> | 0002-3229_7ad983d7-6ac4-4560-a5a7-0c9151cd3831 | Norepinephrine Reuptake
> Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA] |
> +-------------------------------------------------+------------------------------------------------------------------------------------------------+
> 10 rows selected (0.545 seconds)
> 0: jdbc:drill:> select `columns`[0] `id`, FLATTEN(CONVERT_FROM('["' ||
> REGEXP_REPLACE(`columns`[16],',','","') || '"]','JSON')) `classes` FROM
> `maprfs.cmatta`.`product.tsv` limit 10;
> +-------------------------------------------------+---------------------------------------+
> | id | classes
> |
> +-------------------------------------------------+---------------------------------------+
> | 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc | Radioactive Diagnostic
> Agent [EPC] |
> | 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc | Positron Emitting
> Activity [MoA] |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589 | Antiarrhythmic [EPC]
> |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589 | Cytochrome P450 2D6
> Inhibitor [EPC] |
> | 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589 | Cytochrome P450 2D6
> Inhibitors [MoA] |
> | 0002-1433_4137e5fc-0d0b-4171-842d-db80973d1a36 |
> |
> | 0002-1434_4137e5fc-0d0b-4171-842d-db80973d1a36 |
> |
> | 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089 | Androgen [EPC]
> |
> | 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089 | Androgen Receptor
> Agonists [MoA] |
> +-------------------------------------------------+---------------------------------------+
> 10 rows selected (0.876 seconds)
> 0: jdbc:drill:>
>
> If you could come up with a regexp and string formatting to turn your
> var1=test
> var2=test var3=test column into valid JSON you could use CONVERT_FROM as
> Tomer suggested.
>
>
> Chris Matta
> [email protected]
> 215-701-3146
>
> On Sat, Nov 21, 2015 at 6:38 AM, Christopher Matta <[email protected]>
> wrote:
>
>> Along the same lines I have a data set that's delimited by pipe with the
>> last column further delimited by commas. It would be great to be able to
>> flatten that last column.
>>
>>
>> On Wednesday, October 7, 2015, Tugdual Grall <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Can you clarify how to use our own format in CONVERT_FROM?
>>>
>>> 1- is it supported out of the box ? (cannot find that in the doc -
>>>
>>> https://drill.apache.org/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions
>>> - may be not looking at the proper place.
>>>
>>> 2- it is another extension point? so developer can create their own
>>> parser
>>> and deploy it?
>>>
>>>
>>>
>>> Thanks
>>>
>>> On Wed, Oct 7, 2015 at 4:55 PM, Tomer Shiran <[email protected]> wrote:
>>>
>>> > I think it would be much better to leverage something like
>>> > CONVERT_FROM(col, 'MyKeyValueFormat') where the entire key/value cell
>>> is
>>> > parsed into Drill's internal structure just like we do with JSON. That
>>> > exposes all those key/value pairs such that you can leverage them in
>>> other
>>> > SQL operators.
>>> >
>>> > On Wed, Oct 7, 2015 at 7:32 AM, Tugdual Grall <[email protected]>
>>> wrote:
>>> >
>>> >> You can create a custom function to do that:
>>> >> - https://drill.apache.org/docs/develop-custom-functions/
>>> >>
>>> >> this will end with something like:
>>> >> select *
>>> >> from *.csv
>>> >> where GET_VALUE(col[x] , 'var1' ) = 'value'
>>> >>
>>> >> (I will publish, in the next few day, a similar function parsing URL
>>> in
>>> >> values in this repo :
>>> >> https://github.com/mapr-demos/simple-drill-functions )
>>> >>
>>> >> t
>>> >>
>>> >> On Wed, Oct 7, 2015 at 4:22 PM, <[email protected]> wrote:
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > If I have a separated file (PSV/TSV etc) with a number of columns,
>>> each
>>> >> > with a single value aside from the last column which contains a
>>> bunch of
>>> >> > key value pairs in a string e.g. var1=test var2=test var3=test. Is
>>> it
>>> >> > possible to query based on individual key/value items in the last
>>> >> column? I
>>> >> > am looking to querying nested data without using LIKE wildcards in
>>> where
>>> >> > statements (I am not sure this is possible with separated text
>>> data).
>>> >> >
>>> >> > Does anyone have any advice on whether this is possible?
>>> >> >
>>> >> > Thanks,
>>> >> > Mike
>>> >> >
>>> >> >
>>> >> > This e-mail (including any attachments) is private and
>>> confidential, may
>>> >> > contain proprietary or privileged information and is intended for
>>> the
>>> >> named
>>> >> > recipient(s) only. Unintended recipients are strictly prohibited
>>> from
>>> >> > taking action on the basis of information in this e-mail and must
>>> >> contact
>>> >> > the sender immediately, delete this e-mail (and all attachments) and
>>> >> > destroy any hard copies. Nomura will not accept responsibility or
>>> >> liability
>>> >> > for the accuracy or completeness of, or the presence of any virus or
>>> >> > disabling code in, this e-mail. If verification is sought please
>>> >> request a
>>> >> > hard copy. Any reference to the terms of executed transactions
>>> should be
>>> >> > treated as preliminary only and subject to formal written
>>> confirmation
>>> >> by
>>> >> > Nomura. Nomura reserves the right to retain, monitor and intercept
>>> >> e-mail
>>> >> > communications through its networks (subject to and in accordance
>>> with
>>> >> > applicable laws). No confidentiality or privilege is waived or lost
>>> by
>>> >> > Nomura by any mistransmission of this e-mail. Any reference to
>>> "Nomura"
>>> >> is
>>> >> > a reference to any entity in the Nomura Holdings, Inc. group. Please
>>> >> read
>>> >> > our Electronic Communications Legal Notice which forms part of this
>>> >> e-mail:
>>> >> > http://www.Nomura.com/email_disclaimer.htm
>>> >> >
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Tomer Shiran
>>> > CEO and Co-Founder, Dremio
>>> >
>>>
>>
>>
>> --
>> Chris Matta
>> [email protected]
>> 215-701-3146
>>
>>
>