ok. thanks for that tip. i will be using local file system as i'm looking at embedding drill in analytics program. the performance needn't be brilliant for a while, but i will keep an eye on development and use smaller files if needed.
On Thu, Jul 2, 2015 at 4:57 PM, Jacques Nadeau <jacq...@apache.org> wrote: > And one more note. You question didn't say whether you were running on > HDFS or local file system. There is one thing that is a weaknesses to the > local file system capability--we don't split files on block splits. We had > a patch for that out but it hasn't been merged. If you're currently in > that case, you may want to split into smaller files manually until we get a > patch like that merged. > > On Thu, Jul 2, 2015 at 1:48 PM, Jason Altekruse <altekruseja...@gmail.com> > wrote: > > > Just one additional note here, I would strongly advise against converting > > csv files using a select * query out of a csv. > > > > The reason for this is two-fold. Currently we read csv files into a list > of > > varchars, rather than individual columns. While parquet supports lists > and > > we will read them, the read path for complex data has not been optimized > as > > much as our read path for flat data out of parquet. You will get the best > > performance selecting data out of the particular "columns" (we read the > > entire line into a single column with a list of varchars called > `columns`) > > in your csv file with our array syntax and then assigning meaningful > column > > names, for example select columns[0] as user_id, columns[1] as user_name, > > ... from `foo.csv`. Additionally for any columns with particular known > > types like int, float, datetime, etc. I would also recommend that casts > be > > inserted where appropriate. You will get better read performance reading > > fixed width data, rather than reading a file full of varchars. On top of > > the read overhead storing data in the varchars, you would also be adding > > overhead as your future queries would require a cast anyway to actually > > analyze the data. > > > > > > > > On Thu, Jul 2, 2015 at 1:27 PM, Larry White <ljw1...@gmail.com> wrote: > > > > > Great. Thanks much > > > > > > On Thursday, July 2, 2015, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > > > > > Hey Larry, > > > > > > > > Drill transforms your CSV data into an internal memory-resident > format > > > for > > > > processing, but does not change the structure of your original data. > > > > > > > > If you want to convert your file to parquet, you can do this: > > > > > > > > create table `foo.parquet` as select * from `foo.csv` > > > > > > > > > > > > This will, however, not leave you with interesting column names. You > > can > > > > add names inside the select or by putting a parenthesized list of > > fields > > > > after the word 'table'. Often you will want to add casts in the > select > > > to > > > > indicate what type of data you want to use. > > > > > > > > > > > > > > > > > > > > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1...@gmail.com > > > > <javascript:;>> wrote: > > > > > > > > > hi, > > > > > > > > > > i'm using drill to provide a query-able wrapper around some csv > > files. > > > > when > > > > > i load a csv datasource, is the data transformed in someway (beyond > > > what > > > > > Calcite does) to improve performance? Specifically, is it > > transformed > > > > into > > > > > column format? re-written as parquet, or otherwise optimized? > > > > > > > > > > thanks. > > > > > > > > > > larry > > > > > > > > > > > > > > >