Hi Andries,
I am using Drill 1.1.0Configuration is : 
DRILL_MAX_DIRECT_MEMORY="4G"DRILL_HEAP="1G"planner.memory.max_query_memory_per_node
 is 4147483648Physical RAM is 8G. The computer is dedicated to test Drill 
(fresh Win install)
However total.max peak at 2 019 033 088 within Metrics 
During the intial query : - 6 Minor Fragments in the PARQUET_WRITER are 
instantiated - The query fails before starting writing- 3 Minor Fragments for 
each of the 2 PARQUET_ROW_GROUP_SCAN : The first did not start and the Second 
fails with peak memory at 125MB + 121MB + 21MB
Re running the query while Dropping planner.width.max_per_node from 3 to 1 
causes : - to have only 1 minor fragment for each PARQUET_ROW_GROUP_SCAN 
operators- to start both PARQUET_ROW_GROUP_SCAN operators (32K row read over 
760K and 760K over 4840K)- to make the query fails with PARQUET_WRITER and 
HASH_JOIN initiated (Major Fragment 1)
The total peak memory usage within the plan is : - 57MB for PARQUET_WRITER- 
109MB for HASH_JOIN- 170MB for PARQUET_ROW_GROUP_SCAN #1- 360MB for 
PARQUET_ROW_GROUP_SCAN #2- 25MB for a PROJECT operators=> 721MB peak
Do you think my configuration is not appropriate to what I'm trying to do ? I 
am definitely limited by physical memory ?
 Thanks 
RegsBoris


     Le Mardi 4 août 2015 17h10, Andries Engelbrecht 
<[email protected]> a écrit :
   

 How much memory is allocated to Drill in the drill-env.sh file?

CTAS with parquet can consume quite a bit of memory as various structures are 
allocated in memory before the parquet files are written. If you look in the 
query profiles you will get a good indication of the memory usage.

Also see how many fragments are working on creating the parquet files, if you 
are limited on memory you can reduce the number of fragments in CTAS to limit 
memory usage.
You can check planner.width.max_per_node and reduce the number if it is higher 
than 1. 

Which version of Drill are you using?

—Andries


> On Aug 4, 2015, at 7:50 AM, Boris Chmiel <[email protected]> 
> wrote:
> 
> Hi all,
> 
>  I try to figureout how to optimize my queries. I found that when I prepare 
>my data prior toquery it, using CTAS to apply schema and transform my CSV 
>files to Parquetformat, subsequent queries are much likely to reach OOM. 
> 
> i.e :
> 
>  This direct queryon csv files works: 
> 
> CREATE TABLEt3parquet as (
> 
> SELECT * FROMTable1.csv
> 
> INNER JOINTable2.csv ON table1.columns [0] = table2.columns[0]);
> 
>  When thiscombination does not: 
> 
> CREATE TABLEt1parquet AS (
> 
> SELECT 
> 
> CAST(columns[0] ASvarchar(10)) key1)
> 
> CAST(columns[1] …and so on)
> 
> FROM Table1.csv);
> 
> 
>  
> CREATE TABLE t2parquetAS (
> 
> SELECT CAST(columns[0]AS varchar(10)) key1)
> 
> CAST(columns[1] …and so on)
> 
> FROM Table2.csv);
> 
> 
>  
> CREATE TABLE t3parquet as (
> 
> SELECT * FROM t2parquet 
> 
> INNER JOIN t1parquet ON t1parquet.key1 =t2parquet.key1);
> 
> 
>  
> This last query runs OOM on PARQUET_ROW_GROUP_SCAN
> 
> 
>  
> I use embedded mode upon Windows, File system storage,64MB parquet block 
> size, not so big files (less hundreds of MB in raw format) 
> 
> 
>  
> 
>  
> Does the way Drill / Parquet work implies to prefer queries/ views on raw 
> files to save memory rather than parquet ? Does this behavior isnormal ?
> 
> Do you think I my memory configuration should by tunedor does I miss 
> understand something ?
> 
> 
>  
> Thanks in advance, and sorry for my english
> 
> Regards
> 
> Boris
> 
> 


  

Reply via email to