Re: (noob) performance of queries against csv files

Jason Altekruse Thu, 02 Jul 2015 13:49:50 -0700

Just one additional note here, I would strongly advise against converting
csv files using a select * query out of a csv.

The reason for this is two-fold. Currently we read csv files into a list of
varchars, rather than individual columns. While parquet supports lists and
we will read them, the read path for complex data has not been optimized as
much as our read path for flat data out of parquet. You will get the best
performance selecting data out of the particular "columns" (we read the
entire line into a single column with a list of varchars called `columns`)
in your csv file with our array syntax and then assigning meaningful column
names, for example select columns[0] as user_id, columns[1] as user_name,
... from `foo.csv`. Additionally for any columns with particular known
types like int, float, datetime, etc. I would also recommend that casts be
inserted where appropriate. You will get better read performance reading
fixed width data, rather than reading a file full of varchars. On top of
the read overhead storing data in the varchars, you would also be adding
overhead as your future queries would require a cast anyway to actually
analyze the data.

On Thu, Jul 2, 2015 at 1:27 PM, Larry White <ljw1...@gmail.com> wrote:

> Great.  Thanks much
>
> On Thursday, July 2, 2015, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > Hey Larry,
> >
> > Drill transforms your CSV data into an internal memory-resident format
> for
> > processing, but does not change the structure of your original data.
> >
> > If you want to convert your file to parquet, you can do this:
> >
> > create table `foo.parquet` as select * from `foo.csv`
> >
> >
> > This will, however, not leave you with interesting column names.  You can
> > add names inside the select or by putting a parenthesized list of fields
> > after the word 'table'.  Often you will want to add casts in the select
> to
> > indicate what type of data you want to use.
> >
> >
> >
> >
> > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1...@gmail.com
> > <javascript:;>> wrote:
> >
> > > hi,
> > >
> > > i'm using drill to provide a query-able wrapper around some csv files.
> > when
> > > i load a csv datasource, is the data transformed in someway (beyond
> what
> > > Calcite does) to improve performance?  Specifically, is it transformed
> > into
> > > column format? re-written as parquet, or otherwise optimized?
> > >
> > > thanks.
> > >
> > > larry
> > >
> >
>

Re: (noob) performance of queries against csv files

Reply via email to