Re: (noob) performance of queries against csv files

Jacques Nadeau Thu, 02 Jul 2015 13:57:58 -0700

And one more note.  You question didn't say whether you were running on
HDFS or local file system.  There is one thing that is a weaknesses to the
local file system capability--we don't split files on block splits.  We had
a patch for that out but it hasn't been merged.  If you're currently in
that case, you may want to split into smaller files manually until we get a
patch like that merged.


On Thu, Jul 2, 2015 at 1:48 PM, Jason Altekruse <[email protected]>
wrote:

> Just one additional note here, I would strongly advise against converting
> csv files using a select * query out of a csv.
>
> The reason for this is two-fold. Currently we read csv files into a list of
> varchars, rather than individual columns. While parquet supports lists and
> we will read them, the read path for complex data has not been optimized as
> much as our read path for flat data out of parquet. You will get the best
> performance selecting data out of the particular "columns" (we read the
> entire line into a single column with a list of varchars called `columns`)
> in your csv file with our array syntax and then assigning meaningful column
> names, for example select columns[0] as user_id, columns[1] as user_name,
> ... from `foo.csv`. Additionally for any columns with particular known
> types like int, float, datetime, etc. I would also recommend that casts be
> inserted where appropriate. You will get better read performance reading
> fixed width data, rather than reading a file full of varchars. On top of
> the read overhead storing data in the varchars, you would also be adding
> overhead as your future queries would require a cast anyway to actually
> analyze the data.
>
>
>
> On Thu, Jul 2, 2015 at 1:27 PM, Larry White <[email protected]> wrote:
>
> > Great.  Thanks much
> >
> > On Thursday, July 2, 2015, Ted Dunning <[email protected]> wrote:
> >
> > > Hey Larry,
> > >
> > > Drill transforms your CSV data into an internal memory-resident format
> > for
> > > processing, but does not change the structure of your original data.
> > >
> > > If you want to convert your file to parquet, you can do this:
> > >
> > > create table `foo.parquet` as select * from `foo.csv`
> > >
> > >
> > > This will, however, not leave you with interesting column names.  You
> can
> > > add names inside the select or by putting a parenthesized list of
> fields
> > > after the word 'table'.  Often you will want to add casts in the select
> > to
> > > indicate what type of data you want to use.
> > >
> > >
> > >
> > >
> > > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <[email protected]
> > > <javascript:;>> wrote:
> > >
> > > > hi,
> > > >
> > > > i'm using drill to provide a query-able wrapper around some csv
> files.
> > > when
> > > > i load a csv datasource, is the data transformed in someway (beyond
> > what
> > > > Calcite does) to improve performance?  Specifically, is it
> transformed
> > > into
> > > > column format? re-written as parquet, or otherwise optimized?
> > > >
> > > > thanks.
> > > >
> > > > larry
> > > >
> > >
> >
>

Re: (noob) performance of queries against csv files

Reply via email to