Hi all, I try to figureout how to optimize my queries. I found that when I prepare my data prior toquery it, using CTAS to apply schema and transform my CSV files to Parquetformat, subsequent queries are much likely to reach OOM.
i.e : This direct queryon csv files works: CREATE TABLEt3parquet as ( SELECT * FROMTable1.csv INNER JOINTable2.csv ON table1.columns [0] = table2.columns[0]); When thiscombination does not: CREATE TABLEt1parquet AS ( SELECT CAST(columns[0] ASvarchar(10)) key1) CAST(columns[1] …and so on) FROM Table1.csv); CREATE TABLE t2parquetAS ( SELECT CAST(columns[0]AS varchar(10)) key1) CAST(columns[1] …and so on) FROM Table2.csv); CREATE TABLE t3parquet as ( SELECT * FROM t2parquet INNER JOIN t1parquet ON t1parquet.key1 =t2parquet.key1); This last query runs OOM on PARQUET_ROW_GROUP_SCAN I use embedded mode upon Windows, File system storage,64MB parquet block size, not so big files (less hundreds of MB in raw format) Does the way Drill / Parquet work implies to prefer queries/ views on raw files to save memory rather than parquet ? Does this behavior isnormal ? Do you think I my memory configuration should by tunedor does I miss understand something ? Thanks in advance, and sorry for my english Regards Boris
