Re: Querying k/v pairs in a CSV column

Christopher Matta Mon, 23 Nov 2015 19:10:24 -0800

>From a suggestion in a separate thread I did something like this to convert
my delimited substring into a JSON array, this allowed me to use FLATTEN:


0: jdbc:drill:> select `columns`[0] `id`, `columns`[16] `classes` FROM
`maprfs.cmatta`.`product.tsv` limit 10;
+-------------------------------------------------+------------------------------------------------------------------------------------------------+
|                       id                        |
                        classes
     |
+-------------------------------------------------+------------------------------------------------------------------------------------------------+
| 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Radioactive
Diagnostic Agent [EPC],Positron Emitting Activity [MoA]
            |
| 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Antiarrhythmic
[EPC],Cytochrome P450 2D6 Inhibitor [EPC],Cytochrome P450 2D6
Inhibitors [MoA]  |
| 0002-1433_4137e5fc-0d0b-4171-842d-db80973d1a36  |

     |
| 0002-1434_4137e5fc-0d0b-4171-842d-db80973d1a36  |

     |
| 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen
[EPC],Androgen Receptor Agonists [MoA],Androstanes
[Chemical/Ingredient]              |
| 0002-3004_77661866-2f7c-4b90-8866-dce57ebb87bb  | Serotonin Reuptake
Inhibitor [EPC],Serotonin Uptake Inhibitors [MoA]
     |
| 0002-3227_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine
Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]
         |
| 0002-3228_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine
Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]
         |
| 0002-3229_7ad983d7-6ac4-4560-a5a7-0c9151cd3831  | Norepinephrine
Reuptake Inhibitor [EPC],Norepinephrine Uptake Inhibitors [MoA]
         |
+-------------------------------------------------+------------------------------------------------------------------------------------------------+
10 rows selected (0.545 seconds)
0: jdbc:drill:> select `columns`[0] `id`, FLATTEN(CONVERT_FROM('["' ||
REGEXP_REPLACE(`columns`[16],',','","') || '"]','JSON')) `classes`
FROM `maprfs.cmatta`.`product.tsv` limit 10;
+-------------------------------------------------+---------------------------------------+
|                       id                        |
classes                |
+-------------------------------------------------+---------------------------------------+
| 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Radioactive
Diagnostic Agent [EPC]    |
| 0002-1200_88439a12-a98f-4551-a98f-85422ab2fbcc  | Positron Emitting
Activity [MoA]      |
| 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Antiarrhythmic
[EPC]                  |
| 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Cytochrome P450
2D6 Inhibitor [EPC]   |
| 0002-1407_e7af3676-cd9d-4c30-b127-9f6c46ff1589  | Cytochrome P450
2D6 Inhibitors [MoA]  |
| 0002-1433_4137e5fc-0d0b-4171-842d-db80973d1a36  |
                   |
| 0002-1434_4137e5fc-0d0b-4171-842d-db80973d1a36  |
                   |
| 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen [EPC]
                   |
| 0002-1975_6bf7a335-bf6b-4f7f-b8cf-c7b6ee1ba089  | Androgen Receptor
Agonists [MoA]      |
+-------------------------------------------------+---------------------------------------+
10 rows selected (0.876 seconds)
0: jdbc:drill:>

If you could come up with a regexp and string formatting to turn your var1=test
var2=test var3=test column into valid JSON you could use CONVERT_FROM as
Tomer suggested.


Chris Matta
[email protected]
215-701-3146

On Sat, Nov 21, 2015 at 6:38 AM, Christopher Matta <[email protected]> wrote:

> Along the same lines I have a data set that's delimited by pipe with the
> last column further delimited by commas. It would be great to be able to
> flatten that last column.
>
>
> On Wednesday, October 7, 2015, Tugdual Grall <[email protected]> wrote:
>
>> Hi,
>>
>> Can you clarify how to use our own format in CONVERT_FROM?
>>
>> 1- is it supported out of the box ? (cannot find that in the doc -
>>
>> https://drill.apache.org/docs/supported-data-types/#data-types-for-convert_to-and-convert_from-functions
>> - may be not looking at the proper place.
>>
>> 2- it is another extension point? so developer can create their own parser
>> and deploy it?
>>
>>
>>
>> Thanks
>>
>> On Wed, Oct 7, 2015 at 4:55 PM, Tomer Shiran <[email protected]> wrote:
>>
>> > I think it would be much better to leverage something like
>> > CONVERT_FROM(col, 'MyKeyValueFormat') where the entire key/value cell is
>> > parsed into Drill's internal structure just like we do with JSON. That
>> > exposes all those key/value pairs such that you can leverage them in
>> other
>> > SQL operators.
>> >
>> > On Wed, Oct 7, 2015 at 7:32 AM, Tugdual Grall <[email protected]>
>> wrote:
>> >
>> >> You can create a custom function to do that:
>> >> - https://drill.apache.org/docs/develop-custom-functions/
>> >>
>> >> this will end with something like:
>> >>  select *
>> >>  from *.csv
>> >>  where GET_VALUE(col[x] , 'var1' ) = 'value'
>> >>
>> >> (I will publish, in the next few day, a similar function parsing URL in
>> >> values in this repo :
>> >> https://github.com/mapr-demos/simple-drill-functions )
>> >>
>> >> t
>> >>
>> >> On Wed, Oct 7, 2015 at 4:22 PM, <[email protected]> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > If I have a separated file (PSV/TSV etc) with a number of columns,
>> each
>> >> > with a single value aside from the last column which contains a
>> bunch of
>> >> > key value pairs in a string e.g. var1=test var2=test var3=test. Is it
>> >> > possible to query based on individual key/value items in the last
>> >> column? I
>> >> > am looking to querying nested data without using LIKE wildcards in
>> where
>> >> > statements (I am not sure this is possible with separated text data).
>> >> >
>> >> > Does anyone have any advice on whether this is possible?
>> >> >
>> >> > Thanks,
>> >> > Mike
>> >> >
>> >> >
>> >> > This e-mail (including any attachments) is private and confidential,
>> may
>> >> > contain proprietary or privileged information and is intended for the
>> >> named
>> >> > recipient(s) only. Unintended recipients are strictly prohibited from
>> >> > taking action on the basis of information in this e-mail and must
>> >> contact
>> >> > the sender immediately, delete this e-mail (and all attachments) and
>> >> > destroy any hard copies. Nomura will not accept responsibility or
>> >> liability
>> >> > for the accuracy or completeness of, or the presence of any virus or
>> >> > disabling code in, this e-mail. If verification is sought please
>> >> request a
>> >> > hard copy. Any reference to the terms of executed transactions
>> should be
>> >> > treated as preliminary only and subject to formal written
>> confirmation
>> >> by
>> >> > Nomura. Nomura reserves the right to retain, monitor and intercept
>> >> e-mail
>> >> > communications through its networks (subject to and in accordance
>> with
>> >> > applicable laws). No confidentiality or privilege is waived or lost
>> by
>> >> > Nomura by any mistransmission of this e-mail. Any reference to
>> "Nomura"
>> >> is
>> >> > a reference to any entity in the Nomura Holdings, Inc. group. Please
>> >> read
>> >> > our Electronic Communications Legal Notice which forms part of this
>> >> e-mail:
>> >> > http://www.Nomura.com/email_disclaimer.htm
>> >> >
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Tomer Shiran
>> > CEO and Co-Founder, Dremio
>> >
>>
>
>
> --
> Chris Matta
> [email protected]
> 215-701-3146
>
>

Re: Querying k/v pairs in a CSV column

Reply via email to