Re: (noob) performance of queries against csv files

Larry White Thu, 02 Jul 2015 14:46:08 -0700

ok. thanks for that tip. i will be using local file system as i'm looking
at embedding drill in analytics program.  the performance needn't be
brilliant for a while, but i will keep an eye on development and use
smaller files if needed.


On Thu, Jul 2, 2015 at 4:57 PM, Jacques Nadeau <jacq...@apache.org> wrote:

> And one more note.  You question didn't say whether you were running on
> HDFS or local file system.  There is one thing that is a weaknesses to the
> local file system capability--we don't split files on block splits.  We had
> a patch for that out but it hasn't been merged.  If you're currently in
> that case, you may want to split into smaller files manually until we get a
> patch like that merged.
>
> On Thu, Jul 2, 2015 at 1:48 PM, Jason Altekruse <altekruseja...@gmail.com>
> wrote:
>
> > Just one additional note here, I would strongly advise against converting
> > csv files using a select * query out of a csv.
> >
> > The reason for this is two-fold. Currently we read csv files into a list
> of
> > varchars, rather than individual columns. While parquet supports lists
> and
> > we will read them, the read path for complex data has not been optimized
> as
> > much as our read path for flat data out of parquet. You will get the best
> > performance selecting data out of the particular "columns" (we read the
> > entire line into a single column with a list of varchars called
> `columns`)
> > in your csv file with our array syntax and then assigning meaningful
> column
> > names, for example select columns[0] as user_id, columns[1] as user_name,
> > ... from `foo.csv`. Additionally for any columns with particular known
> > types like int, float, datetime, etc. I would also recommend that casts
> be
> > inserted where appropriate. You will get better read performance reading
> > fixed width data, rather than reading a file full of varchars. On top of
> > the read overhead storing data in the varchars, you would also be adding
> > overhead as your future queries would require a cast anyway to actually
> > analyze the data.
> >
> >
> >
> > On Thu, Jul 2, 2015 at 1:27 PM, Larry White <ljw1...@gmail.com> wrote:
> >
> > > Great.  Thanks much
> > >
> > > On Thursday, July 2, 2015, Ted Dunning <ted.dunn...@gmail.com> wrote:
> > >
> > > > Hey Larry,
> > > >
> > > > Drill transforms your CSV data into an internal memory-resident
> format
> > > for
> > > > processing, but does not change the structure of your original data.
> > > >
> > > > If you want to convert your file to parquet, you can do this:
> > > >
> > > > create table `foo.parquet` as select * from `foo.csv`
> > > >
> > > >
> > > > This will, however, not leave you with interesting column names.  You
> > can
> > > > add names inside the select or by putting a parenthesized list of
> > fields
> > > > after the word 'table'.  Often you will want to add casts in the
> select
> > > to
> > > > indicate what type of data you want to use.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1...@gmail.com
> > > > <javascript:;>> wrote:
> > > >
> > > > > hi,
> > > > >
> > > > > i'm using drill to provide a query-able wrapper around some csv
> > files.
> > > > when
> > > > > i load a csv datasource, is the data transformed in someway (beyond
> > > what
> > > > > Calcite does) to improve performance?  Specifically, is it
> > transformed
> > > > into
> > > > > column format? re-written as parquet, or otherwise optimized?
> > > > >
> > > > > thanks.
> > > > >
> > > > > larry
> > > > >
> > > >
> > >
> >
>

Re: (noob) performance of queries against csv files

Reply via email to