Re: (noob) performance of queries against csv files

2015-07-02 Thread Ted Dunning
Hey Larry,

Drill transforms your CSV data into an internal memory-resident format for
processing, but does not change the structure of your original data.

If you want to convert your file to parquet, you can do this:

create table `foo.parquet` as select * from `foo.csv`


This will, however, not leave you with interesting column names.  You can
add names inside the select or by putting a parenthesized list of fields
after the word 'table'.  Often you will want to add casts in the select to
indicate what type of data you want to use.




On Thu, Jul 2, 2015 at 12:48 PM, Larry White ljw1...@gmail.com wrote:

 hi,

 i'm using drill to provide a query-able wrapper around some csv files. when
 i load a csv datasource, is the data transformed in someway (beyond what
 Calcite does) to improve performance?  Specifically, is it transformed into
 column format? re-written as parquet, or otherwise optimized?

 thanks.

 larry



Re: (noob) performance of queries against csv files

2015-07-02 Thread Jason Altekruse
I would recommend select with specific column aliases assigned and casts
where appropriate.

create table parquet_users as select cast(columns[0] as int) as user_id,
columns[1] as username, cast(columns[2] as timestamp) as registration_date
from `users.csv1`;


On Thu, Jul 2, 2015 at 2:46 PM, Larry White ljw1...@gmail.com wrote:

 so the solution is to use select, but with columns specifically defined. is
 that right?

 On Thu, Jul 2, 2015 at 4:48 PM, Jason Altekruse altekruseja...@gmail.com
 wrote:

  Just one additional note here, I would strongly advise against converting
  csv files using a select * query out of a csv.
 
  The reason for this is two-fold. Currently we read csv files into a list
 of
  varchars, rather than individual columns. While parquet supports lists
 and
  we will read them, the read path for complex data has not been optimized
 as
  much as our read path for flat data out of parquet. You will get the best
  performance selecting data out of the particular columns (we read the
  entire line into a single column with a list of varchars called
 `columns`)
  in your csv file with our array syntax and then assigning meaningful
 column
  names, for example select columns[0] as user_id, columns[1] as user_name,
  ... from `foo.csv`. Additionally for any columns with particular known
  types like int, float, datetime, etc. I would also recommend that casts
 be
  inserted where appropriate. You will get better read performance reading
  fixed width data, rather than reading a file full of varchars. On top of
  the read overhead storing data in the varchars, you would also be adding
  overhead as your future queries would require a cast anyway to actually
  analyze the data.
 
 
 
  On Thu, Jul 2, 2015 at 1:27 PM, Larry White ljw1...@gmail.com wrote:
 
   Great.  Thanks much
  
   On Thursday, July 2, 2015, Ted Dunning ted.dunn...@gmail.com wrote:
  
Hey Larry,
   
Drill transforms your CSV data into an internal memory-resident
 format
   for
processing, but does not change the structure of your original data.
   
If you want to convert your file to parquet, you can do this:
   
create table `foo.parquet` as select * from `foo.csv`
   
   
This will, however, not leave you with interesting column names.  You
  can
add names inside the select or by putting a parenthesized list of
  fields
after the word 'table'.  Often you will want to add casts in the
 select
   to
indicate what type of data you want to use.
   
   
   
   
On Thu, Jul 2, 2015 at 12:48 PM, Larry White ljw1...@gmail.com
javascript:; wrote:
   
 hi,

 i'm using drill to provide a query-able wrapper around some csv
  files.
when
 i load a csv datasource, is the data transformed in someway (beyond
   what
 Calcite does) to improve performance?  Specifically, is it
  transformed
into
 column format? re-written as parquet, or otherwise optimized?

 thanks.

 larry

   
  
 



Re: (noob) performance of queries against csv files

2015-07-02 Thread Larry White
so the solution is to use select, but with columns specifically defined. is
that right?

On Thu, Jul 2, 2015 at 4:48 PM, Jason Altekruse altekruseja...@gmail.com
wrote:

 Just one additional note here, I would strongly advise against converting
 csv files using a select * query out of a csv.

 The reason for this is two-fold. Currently we read csv files into a list of
 varchars, rather than individual columns. While parquet supports lists and
 we will read them, the read path for complex data has not been optimized as
 much as our read path for flat data out of parquet. You will get the best
 performance selecting data out of the particular columns (we read the
 entire line into a single column with a list of varchars called `columns`)
 in your csv file with our array syntax and then assigning meaningful column
 names, for example select columns[0] as user_id, columns[1] as user_name,
 ... from `foo.csv`. Additionally for any columns with particular known
 types like int, float, datetime, etc. I would also recommend that casts be
 inserted where appropriate. You will get better read performance reading
 fixed width data, rather than reading a file full of varchars. On top of
 the read overhead storing data in the varchars, you would also be adding
 overhead as your future queries would require a cast anyway to actually
 analyze the data.



 On Thu, Jul 2, 2015 at 1:27 PM, Larry White ljw1...@gmail.com wrote:

  Great.  Thanks much
 
  On Thursday, July 2, 2015, Ted Dunning ted.dunn...@gmail.com wrote:
 
   Hey Larry,
  
   Drill transforms your CSV data into an internal memory-resident format
  for
   processing, but does not change the structure of your original data.
  
   If you want to convert your file to parquet, you can do this:
  
   create table `foo.parquet` as select * from `foo.csv`
  
  
   This will, however, not leave you with interesting column names.  You
 can
   add names inside the select or by putting a parenthesized list of
 fields
   after the word 'table'.  Often you will want to add casts in the select
  to
   indicate what type of data you want to use.
  
  
  
  
   On Thu, Jul 2, 2015 at 12:48 PM, Larry White ljw1...@gmail.com
   javascript:; wrote:
  
hi,
   
i'm using drill to provide a query-able wrapper around some csv
 files.
   when
i load a csv datasource, is the data transformed in someway (beyond
  what
Calcite does) to improve performance?  Specifically, is it
 transformed
   into
column format? re-written as parquet, or otherwise optimized?
   
thanks.
   
larry