Re: Querying fixed-length files

Jason Altekruse Mon, 20 Apr 2015 17:12:07 -0700

Just as a bit of explanation for anyone who finds the thread, what is
happening here is that the csv parser will read files with no commas in
them as a series of records with one value each. This method is a bit of a
clever hack, but it will not work if any of your values have commas in
them. It is possible to configure the Drill delimited text reader to set a
less likely character as the delimiter. This can be done in the web
interface, more info available here
http://drill.apache.org/docs/storage-plugin-configuration/.

These files would best be converted to csv or json by an external tool for
processing in Drill. It would also be possible to write a new storage
plugin to read the files in their current form.

One other small point, where I have seen fixed width files before, the
values are generally separated by spaces (as well as padded by extra spaces
where the values are too small). You can see that this is likely the case
with the example provided with the lengths given above (field1 = characters
1-12, field2 = characters 13-22 etc.)

So the example should probably be written with the data as shown below and
appropriately adjusted bounds on the calls to the `right` function. With
the nature of the format you would also likely need a call to the `trim`
function before you try to cast the values to numeric types. Again, this is
just a workaround, but if you configure the text reader to use a delimiter
you are certain does not appear in the data it should be a robust solution.

aaaaa bbbbbbbbbb 421.5 cc

- Jason

On Mon, Apr 20, 2015 at 4:37 PM, Abhishek Girish <[email protected]>
wrote:

> Hey,
>
> I just tried out something. Not sure if there is a better way than this.
> See if this works for you:
>
> Assuming the text file contains the following line:
>
> $cat /tmp/drill/abc.csv
> aaaaabbbbbbbbbb421.5cc
>
> (field1=aaaaa [chars 1-5], field2=bbbbbbbbbb [chars 6-15], field3=42
> [digits 16-17], field4=1.5 [digits 18-20], ... )
>
> A query similar to the one shown below can work:
>
> > select
>          cast(columns[0] as char(5)),
>          `right`(cast(columns[0] as char(15)),10),
>           cast(`right`(cast(columns[0] as char(17)),2) as int),
>           cast(`right`(cast(columns[0] as char(20)),3) as double)
>     from dfs.tmp.`/drill/abc.csv`;
>
> +------------+------------+------------+------------+
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |
> +------------+------------+------------+------------+
> | aaaaa      | bbbbbbbbbb | 42         | 1.5          |
> +------------+------------+------------+------------+
>
> Regards,
> Abhishek
> On Mon, Apr 20, 2015 at 3:53 PM, Yousef Lasi <[email protected]>
> wrote:
>
> > Does anyone have any suggestions on querying fixed-length data files with
> > Drill? These are files that are received from a mainframe source and the
> > fields within a row are defined by length. For example, field1 =
> characters
> > 1-12, field2 = characters 13-22 etc.
> >
> >  Thanks
> >
>

Re: Querying fixed-length files

Reply via email to