Re: Querying Delimited Sequence file

Robin Moffatt Tue, 30 Aug 2016 08:59:49 -0700

Hi,

Thanks - I think SPLIT gets me some of the way, but after the FLATTEN I
want to PIVOT, so instead of :


0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> select
flatten(split(version,'.')) from sys.version;
+---------+
| EXPR$0  |
+---------+
| 1       |
| 7       |
| 0       |
+---------+

I'd get something like:

+---------+---------+---------+
| EXPR$0  | EXPR$1  | EXPR$2  |
+---------+---------+---------+
| 1       | 7       | 0       |
+---------+---------+---------+

I'm guessing this isn't possible in Drill yet?

Also, what would be be the syntax to enter the \x01 character in the SPLIT
function? Entered literally I get an error:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT
split(CONVERT_FROM(binary_value, 'UTF8'),'\x01') from
 `/user/oracle/seq/pdb.soe.logon` limit 5;
Error: SYSTEM ERROR: IllegalArgumentException: Only single character
delimiters are supported for split()

BTW I didn't realise SPLIT was supported, and it's not listed in
https://drill.apache.org/docs/string-manipulation/ or
https://drill.apache.org/search/?q=split -- is there somewhere I should log
this kind of documentation issue?

thanks, Robin.


On 30 August 2016 at 16:07, Zelaine Fong <[email protected]> wrote:

> If the column is delimited by some character, you can use the SPLIT()
> function to separate the value into an array of values.  You can then use
> the FLATTEN() function to separate the array of values into individual
> records.
>
> E.g., if your column has the value "a:b", where your delimiter is ":", you
> would run the following query:
>
> 0: jdbc:drill:zk=local> select flatten(split(columns[0],':')) from
> `/tmp/foo.csv`;
> +---------+
> | EXPR$0  |
> +---------+
> | a       |
> | b       |
> +---------+
> 2 rows selected (0.319 seconds)
>
> Is that what you had in mind?
>
> -- Zelaine
>
> On Tue, Aug 30, 2016 at 7:17 AM, Robin Moffatt <
> [email protected]> wrote:
>
> > Hi,
> >
> > I'm trying to read a sequence file, in which the key is null and the
> value
> > holds multiple columns [1], delimited by \x01. In Hive I simply define it
> > as :
> >
> > CREATE EXTERNAL TABLE foo (col1 string, col2 string, col3 timestamp)
> > ROW FORMAT DELIMITED
> > STORED as sequencefile
> > LOCATION '/user/oracle/foo/bar';
> >
> > In Drill I've got as far as
> >
> > SELECT CONVERT_FROM(binary_value, 'UTF8') from  `/user/oracle/foo/bar`
> >
> > which yields the data but as a single column. I can cast it to individual
> > columns but this is no use if the field positions change
> >
> > SELECT substr(CONVERT_FROM(binary_value, 'UTF8'),5,1) as
> > col0,substr(CONVERT_FROM(binary_value, 'UTF8'),7,13) as
> > col1,substr(CONVERT_FROM(binary_value, 'UTF8'),20,20) as col3 from
> >  `/user/oracle/seq/pdb.soe.logon` limit 5;
> > +-------+----------------+-----------------------+
> > | col0  |      col1      |         col3          |
> > +-------+----------------+-----------------------+
> > | I     | PDB.SOE.LOGON  | 2016-07-29 13:36:40  |
> >
> >
> > Is there a way to treat a column as delimited and burst it out into
> > multiple columns? Presumably I could somehow dump the string contents to
> > CSV and then re-read it - but I'm interested here in using Drill the
> query
> > existing data; wrangling it to suit Drill isn't really what I'm looking
> for
> > (and maybe Drill just isn't the right tool here?).
> >
> >
> > thanks,
> >
> > Robin.
> >
> > [1]
> > https://docs.oracle.com/goldengate/bd1221/gg-bd/GADBD/
> > GUID-85A82B2E-CD51-463A-8674-3D686C3C0EC0.htm#GADBD-GUID-
> > 4CAFC347-0F7D-49AB-B293-EFBCE95B66D6
> >
>

Re: Querying Delimited Sequence file

Reply via email to