Re: (noob) performance of queries against csv files
Hey Larry, Drill transforms your CSV data into an internal memory-resident format for processing, but does not change the structure of your original data. If you want to convert your file to parquet, you can do this: create table `foo.parquet` as select * from `foo.csv` This will, however, not leave you with interesting column names. You can add names inside the select or by putting a parenthesized list of fields after the word 'table'. Often you will want to add casts in the select to indicate what type of data you want to use. On Thu, Jul 2, 2015 at 12:48 PM, Larry White ljw1...@gmail.com wrote: hi, i'm using drill to provide a query-able wrapper around some csv files. when i load a csv datasource, is the data transformed in someway (beyond what Calcite does) to improve performance? Specifically, is it transformed into column format? re-written as parquet, or otherwise optimized? thanks. larry
Re: (noob) performance of queries against csv files
I would recommend select with specific column aliases assigned and casts where appropriate. create table parquet_users as select cast(columns[0] as int) as user_id, columns[1] as username, cast(columns[2] as timestamp) as registration_date from `users.csv1`; On Thu, Jul 2, 2015 at 2:46 PM, Larry White ljw1...@gmail.com wrote: so the solution is to use select, but with columns specifically defined. is that right? On Thu, Jul 2, 2015 at 4:48 PM, Jason Altekruse altekruseja...@gmail.com wrote: Just one additional note here, I would strongly advise against converting csv files using a select * query out of a csv. The reason for this is two-fold. Currently we read csv files into a list of varchars, rather than individual columns. While parquet supports lists and we will read them, the read path for complex data has not been optimized as much as our read path for flat data out of parquet. You will get the best performance selecting data out of the particular columns (we read the entire line into a single column with a list of varchars called `columns`) in your csv file with our array syntax and then assigning meaningful column names, for example select columns[0] as user_id, columns[1] as user_name, ... from `foo.csv`. Additionally for any columns with particular known types like int, float, datetime, etc. I would also recommend that casts be inserted where appropriate. You will get better read performance reading fixed width data, rather than reading a file full of varchars. On top of the read overhead storing data in the varchars, you would also be adding overhead as your future queries would require a cast anyway to actually analyze the data. On Thu, Jul 2, 2015 at 1:27 PM, Larry White ljw1...@gmail.com wrote: Great. Thanks much On Thursday, July 2, 2015, Ted Dunning ted.dunn...@gmail.com wrote: Hey Larry, Drill transforms your CSV data into an internal memory-resident format for processing, but does not change the structure of your original data. If you want to convert your file to parquet, you can do this: create table `foo.parquet` as select * from `foo.csv` This will, however, not leave you with interesting column names. You can add names inside the select or by putting a parenthesized list of fields after the word 'table'. Often you will want to add casts in the select to indicate what type of data you want to use. On Thu, Jul 2, 2015 at 12:48 PM, Larry White ljw1...@gmail.com javascript:; wrote: hi, i'm using drill to provide a query-able wrapper around some csv files. when i load a csv datasource, is the data transformed in someway (beyond what Calcite does) to improve performance? Specifically, is it transformed into column format? re-written as parquet, or otherwise optimized? thanks. larry
Re: (noob) performance of queries against csv files
so the solution is to use select, but with columns specifically defined. is that right? On Thu, Jul 2, 2015 at 4:48 PM, Jason Altekruse altekruseja...@gmail.com wrote: Just one additional note here, I would strongly advise against converting csv files using a select * query out of a csv. The reason for this is two-fold. Currently we read csv files into a list of varchars, rather than individual columns. While parquet supports lists and we will read them, the read path for complex data has not been optimized as much as our read path for flat data out of parquet. You will get the best performance selecting data out of the particular columns (we read the entire line into a single column with a list of varchars called `columns`) in your csv file with our array syntax and then assigning meaningful column names, for example select columns[0] as user_id, columns[1] as user_name, ... from `foo.csv`. Additionally for any columns with particular known types like int, float, datetime, etc. I would also recommend that casts be inserted where appropriate. You will get better read performance reading fixed width data, rather than reading a file full of varchars. On top of the read overhead storing data in the varchars, you would also be adding overhead as your future queries would require a cast anyway to actually analyze the data. On Thu, Jul 2, 2015 at 1:27 PM, Larry White ljw1...@gmail.com wrote: Great. Thanks much On Thursday, July 2, 2015, Ted Dunning ted.dunn...@gmail.com wrote: Hey Larry, Drill transforms your CSV data into an internal memory-resident format for processing, but does not change the structure of your original data. If you want to convert your file to parquet, you can do this: create table `foo.parquet` as select * from `foo.csv` This will, however, not leave you with interesting column names. You can add names inside the select or by putting a parenthesized list of fields after the word 'table'. Often you will want to add casts in the select to indicate what type of data you want to use. On Thu, Jul 2, 2015 at 12:48 PM, Larry White ljw1...@gmail.com javascript:; wrote: hi, i'm using drill to provide a query-able wrapper around some csv files. when i load a csv datasource, is the data transformed in someway (beyond what Calcite does) to improve performance? Specifically, is it transformed into column format? re-written as parquet, or otherwise optimized? thanks. larry