Hi Andries, thanks for your support
Bumping the heap or disabling hash_join alone does not cause the query to 
completed Drill embedded for Windows behave in a strange way (at least on my 
machine): 
- When I run Drill alone the query falls OOM, with more than 5GB free memory on 
the task monitor.- But when I stress I/O & CPU with a back ground video 
encoding process during the query, it finally completed.
I will try to run Drill embedded on linux on the same machine and also on my 
Macbook to find out if I encounter the same limitation and to determine whether 
is due to may machine or the OS
Boris


     Le Mardi 4 août 2015 18h45, Andries Engelbrecht 
<[email protected]> a écrit :
   

 See if you can bump your heap to 2G.

I have not worked on Windows embedded, but there should be no reason why 
setting the node width to 1 should not work. Used on Linux and Mac with no 
issue to create parquet.

Also decrease the parquet block size to 32MB on such a small system if you keep 
running into memory issues. You can also turn off HASH JOIN to limit memory 
usage (planner.enable_hashjoin), but it may slow down the query. HASH JOIN and 
Parquet writer can consume a lot of memory.

On the last query which columns do you want, a select * on a join may not 
always be ideal with CTAS. Without knowing the data it is hard to follow what 
the desired outcome is.

—Andries


> On Aug 4, 2015, at 9:05 AM, Boris Chmiel <[email protected]> 
> wrote:
> 
> Hi Andries,
> I am using Drill 1.1.0Configuration is : 
> DRILL_MAX_DIRECT_MEMORY="4G"DRILL_HEAP="1G"planner.memory.max_query_memory_per_node
>  is 4147483648Physical RAM is 8G. The computer is dedicated to test Drill 
> (fresh Win install)
> However total.max peak at 2 019 033 088 within Metrics 
> During the intial query : - 6 Minor Fragments in the PARQUET_WRITER are 
> instantiated - The query fails before starting writing- 3 Minor Fragments for 
> each of the 2 PARQUET_ROW_GROUP_SCAN : The first did not start and the Second 
> fails with peak memory at 125MB + 121MB + 21MB
> Re running the query while Dropping planner.width.max_per_node from 3 to 1 
> causes : - to have only 1 minor fragment for each PARQUET_ROW_GROUP_SCAN 
> operators- to start both PARQUET_ROW_GROUP_SCAN operators (32K row read over 
> 760K and 760K over 4840K)- to make the query fails with PARQUET_WRITER and 
> HASH_JOIN initiated (Major Fragment 1)
> The total peak memory usage within the plan is : - 57MB for PARQUET_WRITER- 
> 109MB for HASH_JOIN- 170MB for PARQUET_ROW_GROUP_SCAN #1- 360MB for 
> PARQUET_ROW_GROUP_SCAN #2- 25MB for a PROJECT operators=> 721MB peak
> Do you think my configuration is not appropriate to what I'm trying to do ? I 
> am definitely limited by physical memory ?
> Thanks 
> RegsBoris
> 
> 
>    Le Mardi 4 août 2015 17h10, Andries Engelbrecht 
><[email protected]> a écrit :
> 
> 
> How much memory is allocated to Drill in the drill-env.sh file?
> 
> CTAS with parquet can consume quite a bit of memory as various structures are 
> allocated in memory before the parquet files are written. If you look in the 
> query profiles you will get a good indication of the memory usage.
> 
> Also see how many fragments are working on creating the parquet files, if you 
> are limited on memory you can reduce the number of fragments in CTAS to limit 
> memory usage.
> You can check planner.width.max_per_node and reduce the number if it is 
> higher than 1. 
> 
> Which version of Drill are you using?
> 
> —Andries
> 
> 
>> On Aug 4, 2015, at 7:50 AM, Boris Chmiel <[email protected]> 
>> wrote:
>> 
>> Hi all,
>> 
>>  I try to figureout how to optimize my queries. I found that when I prepare 
>>my data prior toquery it, using CTAS to apply schema and transform my CSV 
>>files to Parquetformat, subsequent queries are much likely to reach OOM. 
>> 
>> i.e :
>> 
>>  This direct queryon csv files works: 
>> 
>> CREATE TABLEt3parquet as (
>> 
>> SELECT * FROMTable1.csv
>> 
>> INNER JOINTable2.csv ON table1.columns [0] = table2.columns[0]);
>> 
>>  When thiscombination does not: 
>> 
>> CREATE TABLEt1parquet AS (
>> 
>> SELECT 
>> 
>> CAST(columns[0] ASvarchar(10)) key1)
>> 
>> CAST(columns[1] …and so on)
>> 
>> FROM Table1.csv);
>> 
>> 
>>  
>> CREATE TABLE t2parquetAS (
>> 
>> SELECT CAST(columns[0]AS varchar(10)) key1)
>> 
>> CAST(columns[1] …and so on)
>> 
>> FROM Table2.csv);
>> 
>> 
>>  
>> CREATE TABLE t3parquet as (
>> 
>> SELECT * FROM t2parquet 
>> 
>> INNER JOIN t1parquet ON t1parquet.key1 =t2parquet.key1);
>> 
>> 
>>  
>> This last query runs OOM on PARQUET_ROW_GROUP_SCAN
>> 
>> 
>>  
>> I use embedded mode upon Windows, File system storage,64MB parquet block 
>> size, not so big files (less hundreds of MB in raw format) 
>> 
>> 
>>  
>> 
>>  
>> Does the way Drill / Parquet work implies to prefer queries/ views on raw 
>> files to save memory rather than parquet ? Does this behavior isnormal ?
>> 
>> Do you think I my memory configuration should by tunedor does I miss 
>> understand something ?
>> 
>> 
>>  
>> Thanks in advance, and sorry for my english
>> 
>> Regards
>> 
>> Boris
>> 
>> 
> 
> 


  

Reply via email to