Re: RE: Fixed-width files

2018-02-20 Thread Paul Rogers
Andries's solution is quite handy for the occasional use. But, having a storage 
plugin available can be more convenient and will perform better. When used with 
table functions, the format plugin allows specifying fields and column names 
per-query if you find yourself querying multiple different files.
Maybe start with the simple approach and grow to the custom approach if 
performance and convenience justify the extra work.
Thanks,
- Paul

 

On Tuesday, February 20, 2018, 2:09:02 PM PST, Kunal Khatua 
 wrote:  
 
 I agree... Using Andries' solution in combination with a view is probably the 
best approach.

-Original Message-
From: Flavio Pompermaier [mailto:pomperma...@okkam.it] 
Sent: Tuesday, February 20, 2018 1:47 PM
To: user@drill.apache.org
Subject: Re: Fixed-width files

Actually what I'd like to achieve, in the end, is to remember how to read a 
fixed-width file.
After considering all your opinions, the best way to achieve this will be 
probably to create a VIEW and then extract through a DESCRIBE query the columns 
definition. What do you think?

On 20 Feb 2018 20:25, "Arjun kr"  wrote:


If you have Hive storage plugin enabled, You can create Hive table with regex 
serde and query the same in Drill.


-- Table contents

$ hadoop fs -cat /tmp/regex_test/*
112123
$

-- Hive DDL with regex '(.{1})(.{2})(.{3})' - column1 of width 1,column2 of 
width 2 and column3 of width 3

CREATE EXTERNAL TABLE `hive_regex_test`(
  `column1` string COMMENT 'from deserializer',
  `column2` string COMMENT 'from deserializer',
  `column3` string COMMENT 'from deserializer') ROW FORMAT SERDE
  'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='(.{1})(.{2})(.{3})')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/tmp/regex_test';

hive>
    > select * from hive_regex_test;
OK
hive_regex_test.column1 hive_regex_test.column2 hive_regex_test.column3
1 12 123
Time taken: 0.235 seconds, Fetched: 1 row(s)
hive>


-- Drill

0: jdbc:drill:schema=dfs> select * from `hive_regex_test`;
+--+--+--+
| column1  | column2  | column3  |
+--+--+--+
| 1        | 12      | 123      |
+--+--+--+
1 row selected (0.587 seconds)
0: jdbc:drill:schema=dfs>

Thanks,

Arjun

________
From: Kunal Khatua 
Sent: Wednesday, February 21, 2018 12:37 AM
To: user@drill.apache.org
Subject: RE: Fixed-width files

This might be a better option, since DRILL-6170 will introduce a rigid parsing 
definition. So, different fixed-width files can't leverage the same definition, 
though they might share the same extension.

Thanks, Andries!

-Original Message-----
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Tuesday, February 20, 2018 7:39 AM
To: user@drill.apache.org
Subject: Re: Fixed-width files

You can also try and see if you can just use the CSV plugin to read a line as 
columns[0] and then use the substr function to pull out the fields in the line.

https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.ap
ache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=csk
dkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_
pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3
Hf60YPLGHMnD8yLg&e=



Here is a simple example



Simple csv file



[test]$ cat test.csv

col1col2col3





jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4), 
substr(columns[0],5,4), substr(columns[0],9,4) from 
dfs.root.`/data/csv/test/test.csv`;

+-+-+-+

| EXPR$0  | EXPR$1  | EXPR$2  |

+-+-+-+

| col1    | col2    | col3    |

+-+-+-+







--Andries









On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:



    For the moment I've created an improvement issue about this:

    https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZ
xdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69
ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW
2Q8weYDZtzb7-U5CqR9fML7ihno&e=



    On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier < pomperma...@okkam.it>

    wrote:



    > Thanks Paul for this suggestion, I think I'm going to give it a try.

    > Once I've created my EasyFormatPlugin where should I put the produced jar?

    > in which folder within jars directory?

    >

    > On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 

    > wrote:

    >

    >> It may be that by "fixed width text", Flavio means a file in which the

    >> text columns are of fixed w

Re: Fixed-width files

2018-02-20 Thread Paul Rogers
Hi Flavio,
Great question! I've not yet experimented with the solution myself, but I 
believe that the plugin can be placed into a jar, along with the needed Drill 
config file, and then placed into the jars/3rd-party directory if you keep your 
config information in the Drill product directory. Perhaps Charles can offer 
more details based on his experience.
You may find it more convenient to use the "site" directory added in Drill 1.8. 
With a site directory, you separate your config files and custom jars from the 
Drill product files. Launch drill with the "--site" flag:
> drillbit.sh --site /my/site/dir start
For convenience, you can set the DRILL_SITE_DIR env var instead of using the 
--site flag.
If using a site directory, put your jar in the "jars" folder.
All that said, while you develop your plugin, you'll want to put the sources 
inside the Drill java-exec project. Why? Doing so allows you to very rapidly 
build and debug your library using your favorite IDE. The test file mentioned 
in the PR shows how to use the test framework to run a query, start an 
in-process Drillbit, and immediately step through (or set breakpoints in) your 
plugin code.
If you build the plugin as a jar file, then for each edit/compile/debug cycle, 
you'll need to build your jar, copy it to the proper location, restart the 
Drill server, attach the remote debugger, start a client tool, and finally 
submit a query. This works, but is quite slow; the above technique is faster 
for us impatient types...
Once the storage plugin works, then you can move the code to a new project from 
which you can build and deploy your jar.
Or, you can do as Charles did: offer your plugin to the Drill project via a PR 
so others can use it.
Thanks,
- Paul

 

On Tuesday, February 20, 2018, 12:24:10 AM PST, Flavio Pompermaier 
 wrote:  
 
 Thanks Paul for this suggestion, I think I'm going to give it a try.
Once I've created my EasyFormatPlugin where should I put the produced jar?
in which folder within jars directory?

On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 
wrote:

> It may be that by "fixed width text", Flavio means a file in which the
> text columns are of fixed width: kind of like old-school punch cards.
> Drill has no reader for this use case, but if you are a Java programmer,
> you can create one. See Drill Pull Request #1114 [1] for one example of a
> regex reader along with pointers to a second example I'm building for a
> book. Should be easy to adopt this code to take a list of column widths in
> place of the regex. Actually, you could use the regex with a pattern that
> just picks out a fixed number of characters.
> Thanks,
> - Paul
>
> [1]  https://github.com/apache/drill/pull/1114
>
>
>
>
>    On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <
> kkha...@mapr.com> wrote:
>
>  As long as you have delimiters, you should be able to import it as a
> regular CSV file. Using views that define the fixed-width nature should
> help operators downstream work more efficiently.
>
> -Original Message-
> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
> Sent: Monday, February 19, 2018 6:50 AM
> To: user@drill.apache.org
> Subject: Fixed-width files
>
> Hi to all,
> I'm currently looking for the best solution to load a fixed-width text
> file into Drill.
> Is there any way right now to do that? Is there anyone that already have a
> working connector?
> Is it better to implement a brand new FormatPluginConfig or
> StoragePluginConfig?
>
> Best,
> Flavio
>
>
  

RE: Fixed-width files

2018-02-20 Thread Kunal Khatua
I agree... Using Andries' solution in combination with a view is probably the 
best approach.

-Original Message-
From: Flavio Pompermaier [mailto:pomperma...@okkam.it] 
Sent: Tuesday, February 20, 2018 1:47 PM
To: user@drill.apache.org
Subject: Re: Fixed-width files

Actually what I'd like to achieve, in the end, is to remember how to read a 
fixed-width file.
After considering all your opinions, the best way to achieve this will be 
probably to create a VIEW and then extract through a DESCRIBE query the columns 
definition. What do you think?

On 20 Feb 2018 20:25, "Arjun kr"  wrote:


If you have Hive storage plugin enabled, You can create Hive table with regex 
serde and query the same in Drill.


-- Table contents

$ hadoop fs -cat /tmp/regex_test/*
112123
$

-- Hive DDL with regex '(.{1})(.{2})(.{3})' - column1 of width 1,column2 of 
width 2 and column3 of width 3

CREATE EXTERNAL TABLE `hive_regex_test`(
  `column1` string COMMENT 'from deserializer',
  `column2` string COMMENT 'from deserializer',
  `column3` string COMMENT 'from deserializer') ROW FORMAT SERDE
  'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='(.{1})(.{2})(.{3})')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/tmp/regex_test';

hive>
> select * from hive_regex_test;
OK
hive_regex_test.column1 hive_regex_test.column2 hive_regex_test.column3
1 12 123
Time taken: 0.235 seconds, Fetched: 1 row(s)
hive>


-- Drill

0: jdbc:drill:schema=dfs> select * from `hive_regex_test`;
+--+--+--+
| column1  | column2  | column3  |
+--+--+--+
| 1| 12   | 123  |
+--+--+--+
1 row selected (0.587 seconds)
0: jdbc:drill:schema=dfs>

Thanks,

Arjun

________
From: Kunal Khatua 
Sent: Wednesday, February 21, 2018 12:37 AM
To: user@drill.apache.org
Subject: RE: Fixed-width files

This might be a better option, since DRILL-6170 will introduce a rigid parsing 
definition. So, different fixed-width files can't leverage the same definition, 
though they might share the same extension.

Thanks, Andries!

-Original Message-----
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Tuesday, February 20, 2018 7:39 AM
To: user@drill.apache.org
Subject: Re: Fixed-width files

You can also try and see if you can just use the CSV plugin to read a line as 
columns[0] and then use the substr function to pull out the fields in the line.

https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.ap
ache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=csk
dkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_
pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3
Hf60YPLGHMnD8yLg&e=



Here is a simple example



Simple csv file



[test]$ cat test.csv

col1col2col3





jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4), 
substr(columns[0],5,4), substr(columns[0],9,4) from 
dfs.root.`/data/csv/test/test.csv`;

+-+-+-+

| EXPR$0  | EXPR$1  | EXPR$2  |

+-+-+-+

| col1| col2| col3|

+-+-+-+







--Andries









On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:



For the moment I've created an improvement issue about this:

https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZ
xdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69
ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW
2Q8weYDZtzb7-U5CqR9fML7ihno&e=



On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier < pomperma...@okkam.it>

wrote:



> Thanks Paul for this suggestion, I think I'm going to give it a try.

> Once I've created my EasyFormatPlugin where should I put the produced jar?

> in which folder within jars directory?

>

> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 

> wrote:

>

>> It may be that by "fixed width text", Flavio means a file in which the

>> text columns are of fixed width: kind of like old-school punch cards.

>> Drill has no reader for this use case, but if you are a Java programmer,

>> you can create one. See Drill Pull Request #1114 [1] for one example of a

>> regex reader along with pointers to a second example I'm building for a

>> book. Should be easy to adopt this code to take a list of column widths 
in

>> place of the regex. Actually, you could use the regex with a pattern that

>> just picks out a fixed number

Re: Fixed-width files

2018-02-20 Thread Flavio Pompermaier
Actually what I'd like to achieve, in the end, is to remember how to read a
fixed-width file.
After considering all your opinions, the best way to achieve this will be
probably to create a VIEW and then extract through a DESCRIBE query the
columns definition. What do you think?

On 20 Feb 2018 20:25, "Arjun kr"  wrote:


If you have Hive storage plugin enabled, You can create Hive table with
regex serde and query the same in Drill.


-- Table contents

$ hadoop fs -cat /tmp/regex_test/*
112123
$

-- Hive DDL with regex '(.{1})(.{2})(.{3})' - column1 of width 1,column2 of
width 2 and column3 of width 3

CREATE EXTERNAL TABLE `hive_regex_test`(
  `column1` string COMMENT 'from deserializer',
  `column2` string COMMENT 'from deserializer',
  `column3` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='(.{1})(.{2})(.{3})')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/tmp/regex_test';

hive>
> select * from hive_regex_test;
OK
hive_regex_test.column1 hive_regex_test.column2 hive_regex_test.column3
1 12 123
Time taken: 0.235 seconds, Fetched: 1 row(s)
hive>


-- Drill

0: jdbc:drill:schema=dfs> select * from `hive_regex_test`;
+--+--+--+
| column1  | column2  | column3  |
+--+--+--+
| 1| 12   | 123  |
+--+--+--+
1 row selected (0.587 seconds)
0: jdbc:drill:schema=dfs>

Thanks,

Arjun

________
From: Kunal Khatua 
Sent: Wednesday, February 21, 2018 12:37 AM
To: user@drill.apache.org
Subject: RE: Fixed-width files

This might be a better option, since DRILL-6170 will introduce a rigid
parsing definition. So, different fixed-width files can't leverage the same
definition, though they might share the same extension.

Thanks, Andries!

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Tuesday, February 20, 2018 7:39 AM
To: user@drill.apache.org
Subject: Re: Fixed-width files

You can also try and see if you can just use the CSV plugin to read a line
as columns[0] and then use the substr function to pull out the fields in
the line.

https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.ap
ache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=csk
dkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_
pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3
Hf60YPLGHMnD8yLg&e=



Here is a simple example



Simple csv file



[test]$ cat test.csv

col1col2col3





jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4),
substr(columns[0],5,4), substr(columns[0],9,4) from
dfs.root.`/data/csv/test/test.csv`;

+-+-+-+

| EXPR$0  | EXPR$1  | EXPR$2  |

+-+-+-+

| col1| col2| col3|

+-+-+-+







--Andries









On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:



For the moment I've created an improvement issue about this:

https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZ
xdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69
ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW
2Q8weYDZtzb7-U5CqR9fML7ihno&e=



On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier <
pomperma...@okkam.it>

wrote:



> Thanks Paul for this suggestion, I think I'm going to give it a try.

> Once I've created my EasyFormatPlugin where should I put the produced
jar?

> in which folder within jars directory?

>

> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers


> wrote:

>

>> It may be that by "fixed width text", Flavio means a file in which
the

>> text columns are of fixed width: kind of like old-school punch cards.

>> Drill has no reader for this use case, but if you are a Java
programmer,

>> you can create one. See Drill Pull Request #1114 [1] for one example
of a

>> regex reader along with pointers to a second example I'm building
for a

>> book. Should be easy to adopt this code to take a list of column
widths in

>> place of the regex. Actually, you could use the regex with a pattern
that

>> just picks out a fixed number of characters.

>> Thanks,

>> - Paul

>>

>> [1]  https://urldefense.proofpoint.com/v2/url?u=https-3A__github.
com_apache_drill_pull_1114&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg
&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhId
PzNBy3Zsq

Re: Fixed-width files

2018-02-20 Thread Arjun kr

If you have Hive storage plugin enabled, You can create Hive table with regex 
serde and query the same in Drill.


-- Table contents

$ hadoop fs -cat /tmp/regex_test/*
112123
$

-- Hive DDL with regex '(.{1})(.{2})(.{3})' - column1 of width 1,column2 of 
width 2 and column3 of width 3

CREATE EXTERNAL TABLE `hive_regex_test`(
  `column1` string COMMENT 'from deserializer',
  `column2` string COMMENT 'from deserializer',
  `column3` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='(.{1})(.{2})(.{3})')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/tmp/regex_test';

hive>
> select * from hive_regex_test;
OK
hive_regex_test.column1 hive_regex_test.column2 hive_regex_test.column3
1 12 123
Time taken: 0.235 seconds, Fetched: 1 row(s)
hive>


-- Drill

0: jdbc:drill:schema=dfs> select * from `hive_regex_test`;
+--+--+--+
| column1  | column2  | column3  |
+--+--+--+
| 1| 12   | 123  |
+--+--+--+
1 row selected (0.587 seconds)
0: jdbc:drill:schema=dfs>

Thanks,

Arjun

____
From: Kunal Khatua 
Sent: Wednesday, February 21, 2018 12:37 AM
To: user@drill.apache.org
Subject: RE: Fixed-width files

This might be a better option, since DRILL-6170 will introduce a rigid parsing 
definition. So, different fixed-width files can't leverage the same definition, 
though they might share the same extension.

Thanks, Andries!

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com]
Sent: Tuesday, February 20, 2018 7:39 AM
To: user@drill.apache.org
Subject: Re: Fixed-width files

You can also try and see if you can just use the CSV plugin to read a line as 
columns[0] and then use the substr function to pull out the fields in the line.

https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.apache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3Hf60YPLGHMnD8yLg&e=



Here is a simple example



Simple csv file



[test]$ cat test.csv

col1col2col3





jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4), 
substr(columns[0],5,4), substr(columns[0],9,4) from  
dfs.root.`/data/csv/test/test.csv`;

+-+-+-+

| EXPR$0  | EXPR$1  | EXPR$2  |

+-+-+-+

| col1| col2| col3|

+-+-+-+







--Andries









On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:



For the moment I've created an improvement issue about this:


https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW2Q8weYDZtzb7-U5CqR9fML7ihno&e=



On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier 

wrote:



> Thanks Paul for this suggestion, I think I'm going to give it a try.

> Once I've created my EasyFormatPlugin where should I put the produced jar?

> in which folder within jars directory?

>

> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 

> wrote:

>

>> It may be that by "fixed width text", Flavio means a file in which the

>> text columns are of fixed width: kind of like old-school punch cards.

>> Drill has no reader for this use case, but if you are a Java programmer,

>> you can create one. See Drill Pull Request #1114 [1] for one example of a

>> regex reader along with pointers to a second example I'm building for a

>> book. Should be easy to adopt this code to take a list of column widths 
in

>> place of the regex. Actually, you could use the regex with a pattern that

>> just picks out a fixed number of characters.

>> Thanks,

>> - Paul

>>

>> [1]  
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1114&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=-0LdlBnmAXaipanP87yJezn5HPEHQIQVX5izxnNTYFY&e=

>>

>>

>>

>>

>> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <

>> kkha...@mapr.com> wrote:

>>

>>  As long as you have delimiters, you should be able to import it as a

  

RE: Fixed-width files

2018-02-20 Thread Kunal Khatua
This might be a better option, since DRILL-6170 will introduce a rigid parsing 
definition. So, different fixed-width files can't leverage the same definition, 
though they might share the same extension. 

Thanks, Andries!

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] 
Sent: Tuesday, February 20, 2018 7:39 AM
To: user@drill.apache.org
Subject: Re: Fixed-width files

You can also try and see if you can just use the CSV plugin to read a line as 
columns[0] and then use the substr function to pull out the fields in the line.

https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.apache.org_docs_string-2Dmanipulation_-23substr&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=oItppN_rkOKe_pgJb06T71ul6__8GsXmWQzTOQlCvBc&s=u6-Tx7rmfJQDa3_W3hg7YxojXP3Hf60YPLGHMnD8yLg&e=



Here is a simple example



Simple csv file



[test]$ cat test.csv

col1col2col3





jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4), 
substr(columns[0],5,4), substr(columns[0],9,4) from  
dfs.root.`/data/csv/test/test.csv`;

+-+-+-+

| EXPR$0  | EXPR$1  | EXPR$2  |

+-+-+-+

| col1| col2| col3|

+-+-+-+







--Andries









On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:



For the moment I've created an improvement issue about this:


https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW2Q8weYDZtzb7-U5CqR9fML7ihno&e=



On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier 

wrote:



> Thanks Paul for this suggestion, I think I'm going to give it a try.

> Once I've created my EasyFormatPlugin where should I put the produced jar?

> in which folder within jars directory?

>

> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 

> wrote:

>

>> It may be that by "fixed width text", Flavio means a file in which the

>> text columns are of fixed width: kind of like old-school punch cards.

>> Drill has no reader for this use case, but if you are a Java programmer,

>> you can create one. See Drill Pull Request #1114 [1] for one example of a

>> regex reader along with pointers to a second example I'm building for a

>> book. Should be easy to adopt this code to take a list of column widths 
in

>> place of the regex. Actually, you could use the regex with a pattern that

>> just picks out a fixed number of characters.

>> Thanks,

>> - Paul

>>

>> [1]  
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1114&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=-0LdlBnmAXaipanP87yJezn5HPEHQIQVX5izxnNTYFY&e=

>>

>>

>>

>>

>> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <

>> kkha...@mapr.com> wrote:

>>

>>  As long as you have delimiters, you should be able to import it as a

>> regular CSV file. Using views that define the fixed-width nature should

>> help operators downstream work more efficiently.

>>

>> -Original Message-

>> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]

>> Sent: Monday, February 19, 2018 6:50 AM

>> To: user@drill.apache.org

>> Subject: Fixed-width files

>>

>> Hi to all,

>> I'm currently looking for the best solution to load a fixed-width text

>> file into Drill.

>> Is there any way right now to do that? Is there anyone that already have

>> a working connector?

>> Is it better to implement a brand new FormatPluginConfig or

>> StoragePluginConfig?

>>

>> Best,

>> Flavio

>>

>>

>







Re: Fixed-width files

2018-02-20 Thread Andries Engelbrecht
You can also try and see if you can just use the CSV plugin to read a line as 
columns[0] and then use the substr function to pull out the fields in the line.
http://drill.apache.org/docs/string-manipulation/#substr

Here is a simple example

Simple csv file

[test]$ cat test.csv
col1col2col3


jdbc:drill:zk=localhost:5181> select substr(columns[0],1,4), 
substr(columns[0],5,4), substr(columns[0],9,4) from  
dfs.root.`/data/csv/test/test.csv`;
+-+-+-+
| EXPR$0  | EXPR$1  | EXPR$2  |
+-+-+-+
| col1| col2| col3|
+-+-+-+



--Andries




On 2/20/18, 1:17 AM, "Flavio Pompermaier"  wrote:

For the moment I've created an improvement issue about this:

https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6170&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=ajRYDHKrMFgV1AMW2Q8weYDZtzb7-U5CqR9fML7ihno&e=

On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier 
wrote:

> Thanks Paul for this suggestion, I think I'm going to give it a try.
> Once I've created my EasyFormatPlugin where should I put the produced jar?
> in which folder within jars directory?
>
> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 
> wrote:
>
>> It may be that by "fixed width text", Flavio means a file in which the
>> text columns are of fixed width: kind of like old-school punch cards.
>> Drill has no reader for this use case, but if you are a Java programmer,
>> you can create one. See Drill Pull Request #1114 [1] for one example of a
>> regex reader along with pointers to a second example I'm building for a
>> book. Should be easy to adopt this code to take a list of column widths 
in
>> place of the regex. Actually, you could use the regex with a pattern that
>> just picks out a fixed number of characters.
>> Thanks,
>> - Paul
>>
>> [1]  
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1114&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=69ohaJkyhIdPzNBy3ZsqNCTa19XysjZzgmn_XPJ2yXQ&s=-0LdlBnmAXaipanP87yJezn5HPEHQIQVX5izxnNTYFY&e=
>>
>>
>>
>>
>> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <
>> kkha...@mapr.com> wrote:
>>
>>  As long as you have delimiters, you should be able to import it as a
>> regular CSV file. Using views that define the fixed-width nature should
>> help operators downstream work more efficiently.
>>
>> -Original Message-
>> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
>> Sent: Monday, February 19, 2018 6:50 AM
>> To: user@drill.apache.org
>> Subject: Fixed-width files
>>
>> Hi to all,
>> I'm currently looking for the best solution to load a fixed-width text
>> file into Drill.
>> Is there any way right now to do that? Is there anyone that already have
>> a working connector?
>> Is it better to implement a brand new FormatPluginConfig or
>> StoragePluginConfig?
>>
>> Best,
>> Flavio
>>
>>
>




Re: RE: Fixed-width files

2018-02-20 Thread Flavio Pompermaier
For the moment I've created an improvement issue about this:
https://issues.apache.org/jira/browse/DRILL-6170

On Tue, Feb 20, 2018 at 9:23 AM, Flavio Pompermaier 
wrote:

> Thanks Paul for this suggestion, I think I'm going to give it a try.
> Once I've created my EasyFormatPlugin where should I put the produced jar?
> in which folder within jars directory?
>
> On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 
> wrote:
>
>> It may be that by "fixed width text", Flavio means a file in which the
>> text columns are of fixed width: kind of like old-school punch cards.
>> Drill has no reader for this use case, but if you are a Java programmer,
>> you can create one. See Drill Pull Request #1114 [1] for one example of a
>> regex reader along with pointers to a second example I'm building for a
>> book. Should be easy to adopt this code to take a list of column widths in
>> place of the regex. Actually, you could use the regex with a pattern that
>> just picks out a fixed number of characters.
>> Thanks,
>> - Paul
>>
>> [1]  https://github.com/apache/drill/pull/1114
>>
>>
>>
>>
>> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <
>> kkha...@mapr.com> wrote:
>>
>>  As long as you have delimiters, you should be able to import it as a
>> regular CSV file. Using views that define the fixed-width nature should
>> help operators downstream work more efficiently.
>>
>> -Original Message-
>> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
>> Sent: Monday, February 19, 2018 6:50 AM
>> To: user@drill.apache.org
>> Subject: Fixed-width files
>>
>> Hi to all,
>> I'm currently looking for the best solution to load a fixed-width text
>> file into Drill.
>> Is there any way right now to do that? Is there anyone that already have
>> a working connector?
>> Is it better to implement a brand new FormatPluginConfig or
>> StoragePluginConfig?
>>
>> Best,
>> Flavio
>>
>>
>


Re: RE: Fixed-width files

2018-02-20 Thread Flavio Pompermaier
Thanks Paul for this suggestion, I think I'm going to give it a try.
Once I've created my EasyFormatPlugin where should I put the produced jar?
in which folder within jars directory?

On Tue, Feb 20, 2018 at 2:57 AM, Paul Rogers 
wrote:

> It may be that by "fixed width text", Flavio means a file in which the
> text columns are of fixed width: kind of like old-school punch cards.
> Drill has no reader for this use case, but if you are a Java programmer,
> you can create one. See Drill Pull Request #1114 [1] for one example of a
> regex reader along with pointers to a second example I'm building for a
> book. Should be easy to adopt this code to take a list of column widths in
> place of the regex. Actually, you could use the regex with a pattern that
> just picks out a fixed number of characters.
> Thanks,
> - Paul
>
> [1]  https://github.com/apache/drill/pull/1114
>
>
>
>
> On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua <
> kkha...@mapr.com> wrote:
>
>  As long as you have delimiters, you should be able to import it as a
> regular CSV file. Using views that define the fixed-width nature should
> help operators downstream work more efficiently.
>
> -Original Message-
> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
> Sent: Monday, February 19, 2018 6:50 AM
> To: user@drill.apache.org
> Subject: Fixed-width files
>
> Hi to all,
> I'm currently looking for the best solution to load a fixed-width text
> file into Drill.
> Is there any way right now to do that? Is there anyone that already have a
> working connector?
> Is it better to implement a brand new FormatPluginConfig or
> StoragePluginConfig?
>
> Best,
> Flavio
>
>


Re: RE: Fixed-width files

2018-02-19 Thread Paul Rogers
It may be that by "fixed width text", Flavio means a file in which the text 
columns are of fixed width: kind of like old-school punch cards.
Drill has no reader for this use case, but if you are a Java programmer, you 
can create one. See Drill Pull Request #1114 [1] for one example of a regex 
reader along with pointers to a second example I'm building for a book. Should 
be easy to adopt this code to take a list of column widths in place of the 
regex. Actually, you could use the regex with a pattern that just picks out a 
fixed number of characters.
Thanks,
- Paul

[1]  https://github.com/apache/drill/pull/1114


 

On Monday, February 19, 2018, 12:52:42 PM PST, Kunal Khatua 
 wrote:  
 
 As long as you have delimiters, you should be able to import it as a regular 
CSV file. Using views that define the fixed-width nature should help operators 
downstream work more efficiently. 

-Original Message-
From: Flavio Pompermaier [mailto:pomperma...@okkam.it] 
Sent: Monday, February 19, 2018 6:50 AM
To: user@drill.apache.org
Subject: Fixed-width files

Hi to all,
I'm currently looking for the best solution to load a fixed-width text file 
into Drill.
Is there any way right now to do that? Is there anyone that already have a 
working connector?
Is it better to implement a brand new FormatPluginConfig or StoragePluginConfig?

Best,
Flavio
  

Re: Fixed-width files

2018-02-19 Thread Flavio Pompermaier
Do you have any real example of this (apart the one reported at [1])?

[1] https://drill.apache.org/docs/text-files-csv-tsv-psv/

On Mon, Feb 19, 2018 at 9:52 PM, Kunal Khatua  wrote:

> As long as you have delimiters, you should be able to import it as a
> regular CSV file. Using views that define the fixed-width nature should
> help operators downstream work more efficiently.
>
> -Original Message-
> From: Flavio Pompermaier [mailto:pomperma...@okkam.it]
> Sent: Monday, February 19, 2018 6:50 AM
> To: user@drill.apache.org
> Subject: Fixed-width files
>
> Hi to all,
> I'm currently looking for the best solution to load a fixed-width text
> file into Drill.
> Is there any way right now to do that? Is there anyone that already have a
> working connector?
> Is it better to implement a brand new FormatPluginConfig or
> StoragePluginConfig?
>
> Best,
> Flavio
>



-- 
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 041809


RE: Fixed-width files

2018-02-19 Thread Kunal Khatua
As long as you have delimiters, you should be able to import it as a regular 
CSV file. Using views that define the fixed-width nature should help operators 
downstream work more efficiently. 

-Original Message-
From: Flavio Pompermaier [mailto:pomperma...@okkam.it] 
Sent: Monday, February 19, 2018 6:50 AM
To: user@drill.apache.org
Subject: Fixed-width files

Hi to all,
I'm currently looking for the best solution to load a fixed-width text file 
into Drill.
Is there any way right now to do that? Is there anyone that already have a 
working connector?
Is it better to implement a brand new FormatPluginConfig or StoragePluginConfig?

Best,
Flavio


Fixed-width files

2018-02-19 Thread Flavio Pompermaier
Hi to all,
I'm currently looking for the best solution to load a fixed-width text file
into Drill.
Is there any way right now to do that? Is there anyone that already have a
working connector?
Is it better to implement a brand new FormatPluginConfig or
StoragePluginConfig?

Best,
Flavio