Hi all,

 I try to figureout how to optimize my queries. I found that when I prepare my 
data prior toquery it, using CTAS to apply schema and transform my CSV files to 
Parquetformat, subsequent queries are much likely to reach OOM. 

i.e :

 This direct queryon csv files works: 

CREATE TABLEt3parquet as (

SELECT * FROMTable1.csv

INNER JOINTable2.csv ON table1.columns [0] = table2.columns[0]);

 When thiscombination does not: 

CREATE TABLEt1parquet AS (

SELECT 

CAST(columns[0] ASvarchar(10)) key1)

CAST(columns[1] …and so on)

FROM Table1.csv);


 
CREATE TABLE t2parquetAS (

SELECT CAST(columns[0]AS varchar(10)) key1)

CAST(columns[1] …and so on)

FROM Table2.csv);


 
CREATE TABLE t3parquet as (

SELECT * FROM t2parquet 

INNER JOIN t1parquet ON t1parquet.key1 =t2parquet.key1);


 
This last query runs OOM on PARQUET_ROW_GROUP_SCAN


 
I use embedded mode upon Windows, File system storage,64MB parquet block size, 
not so big files (less hundreds of MB in raw format) 


 

 
Does the way Drill / Parquet work implies to prefer queries/ views on raw files 
to save memory rather than parquet ? Does this behavior isnormal ?

Do you think I my memory configuration should by tunedor does I miss understand 
something ?


 
Thanks in advance, and sorry for my english

Regards

Boris


Reply via email to