And one more note. You question didn't say whether you were running on HDFS or local file system. There is one thing that is a weaknesses to the local file system capability--we don't split files on block splits. We had a patch for that out but it hasn't been merged. If you're currently in that case, you may want to split into smaller files manually until we get a patch like that merged.
On Thu, Jul 2, 2015 at 1:48 PM, Jason Altekruse <altekruseja...@gmail.com> wrote: > Just one additional note here, I would strongly advise against converting > csv files using a select * query out of a csv. > > The reason for this is two-fold. Currently we read csv files into a list of > varchars, rather than individual columns. While parquet supports lists and > we will read them, the read path for complex data has not been optimized as > much as our read path for flat data out of parquet. You will get the best > performance selecting data out of the particular "columns" (we read the > entire line into a single column with a list of varchars called `columns`) > in your csv file with our array syntax and then assigning meaningful column > names, for example select columns[0] as user_id, columns[1] as user_name, > ... from `foo.csv`. Additionally for any columns with particular known > types like int, float, datetime, etc. I would also recommend that casts be > inserted where appropriate. You will get better read performance reading > fixed width data, rather than reading a file full of varchars. On top of > the read overhead storing data in the varchars, you would also be adding > overhead as your future queries would require a cast anyway to actually > analyze the data. > > > > On Thu, Jul 2, 2015 at 1:27 PM, Larry White <ljw1...@gmail.com> wrote: > > > Great. Thanks much > > > > On Thursday, July 2, 2015, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > > > Hey Larry, > > > > > > Drill transforms your CSV data into an internal memory-resident format > > for > > > processing, but does not change the structure of your original data. > > > > > > If you want to convert your file to parquet, you can do this: > > > > > > create table `foo.parquet` as select * from `foo.csv` > > > > > > > > > This will, however, not leave you with interesting column names. You > can > > > add names inside the select or by putting a parenthesized list of > fields > > > after the word 'table'. Often you will want to add casts in the select > > to > > > indicate what type of data you want to use. > > > > > > > > > > > > > > > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1...@gmail.com > > > <javascript:;>> wrote: > > > > > > > hi, > > > > > > > > i'm using drill to provide a query-able wrapper around some csv > files. > > > when > > > > i load a csv datasource, is the data transformed in someway (beyond > > what > > > > Calcite does) to improve performance? Specifically, is it > transformed > > > into > > > > column format? re-written as parquet, or otherwise optimized? > > > > > > > > thanks. > > > > > > > > larry > > > > > > > > > >