Hi Andries, thanks for your support
Bumping the heap or disabling hash_join alone does not cause the query to
completed Drill embedded for Windows behave in a strange way (at least on my
machine):
- When I run Drill alone the query falls OOM, with more than 5GB free memory on
the task monitor.- But when I stress I/O & CPU with a back ground video
encoding process during the query, it finally completed.
I will try to run Drill embedded on linux on the same machine and also on my
Macbook to find out if I encounter the same limitation and to determine whether
is due to may machine or the OS
Boris
Le Mardi 4 août 2015 18h45, Andries Engelbrecht
<[email protected]> a écrit :
See if you can bump your heap to 2G.
I have not worked on Windows embedded, but there should be no reason why
setting the node width to 1 should not work. Used on Linux and Mac with no
issue to create parquet.
Also decrease the parquet block size to 32MB on such a small system if you keep
running into memory issues. You can also turn off HASH JOIN to limit memory
usage (planner.enable_hashjoin), but it may slow down the query. HASH JOIN and
Parquet writer can consume a lot of memory.
On the last query which columns do you want, a select * on a join may not
always be ideal with CTAS. Without knowing the data it is hard to follow what
the desired outcome is.
—Andries
> On Aug 4, 2015, at 9:05 AM, Boris Chmiel <[email protected]>
> wrote:
>
> Hi Andries,
> I am using Drill 1.1.0Configuration is :
> DRILL_MAX_DIRECT_MEMORY="4G"DRILL_HEAP="1G"planner.memory.max_query_memory_per_node
> is 4147483648Physical RAM is 8G. The computer is dedicated to test Drill
> (fresh Win install)
> However total.max peak at 2 019 033 088 within Metrics
> During the intial query : - 6 Minor Fragments in the PARQUET_WRITER are
> instantiated - The query fails before starting writing- 3 Minor Fragments for
> each of the 2 PARQUET_ROW_GROUP_SCAN : The first did not start and the Second
> fails with peak memory at 125MB + 121MB + 21MB
> Re running the query while Dropping planner.width.max_per_node from 3 to 1
> causes : - to have only 1 minor fragment for each PARQUET_ROW_GROUP_SCAN
> operators- to start both PARQUET_ROW_GROUP_SCAN operators (32K row read over
> 760K and 760K over 4840K)- to make the query fails with PARQUET_WRITER and
> HASH_JOIN initiated (Major Fragment 1)
> The total peak memory usage within the plan is : - 57MB for PARQUET_WRITER-
> 109MB for HASH_JOIN- 170MB for PARQUET_ROW_GROUP_SCAN #1- 360MB for
> PARQUET_ROW_GROUP_SCAN #2- 25MB for a PROJECT operators=> 721MB peak
> Do you think my configuration is not appropriate to what I'm trying to do ? I
> am definitely limited by physical memory ?
> Thanks
> RegsBoris
>
>
> Le Mardi 4 août 2015 17h10, Andries Engelbrecht
><[email protected]> a écrit :
>
>
> How much memory is allocated to Drill in the drill-env.sh file?
>
> CTAS with parquet can consume quite a bit of memory as various structures are
> allocated in memory before the parquet files are written. If you look in the
> query profiles you will get a good indication of the memory usage.
>
> Also see how many fragments are working on creating the parquet files, if you
> are limited on memory you can reduce the number of fragments in CTAS to limit
> memory usage.
> You can check planner.width.max_per_node and reduce the number if it is
> higher than 1.
>
> Which version of Drill are you using?
>
> —Andries
>
>
>> On Aug 4, 2015, at 7:50 AM, Boris Chmiel <[email protected]>
>> wrote:
>>
>> Hi all,
>>
>> I try to figureout how to optimize my queries. I found that when I prepare
>>my data prior toquery it, using CTAS to apply schema and transform my CSV
>>files to Parquetformat, subsequent queries are much likely to reach OOM.
>>
>> i.e :
>>
>> This direct queryon csv files works:
>>
>> CREATE TABLEt3parquet as (
>>
>> SELECT * FROMTable1.csv
>>
>> INNER JOINTable2.csv ON table1.columns [0] = table2.columns[0]);
>>
>> When thiscombination does not:
>>
>> CREATE TABLEt1parquet AS (
>>
>> SELECT
>>
>> CAST(columns[0] ASvarchar(10)) key1)
>>
>> CAST(columns[1] …and so on)
>>
>> FROM Table1.csv);
>>
>>
>>
>> CREATE TABLE t2parquetAS (
>>
>> SELECT CAST(columns[0]AS varchar(10)) key1)
>>
>> CAST(columns[1] …and so on)
>>
>> FROM Table2.csv);
>>
>>
>>
>> CREATE TABLE t3parquet as (
>>
>> SELECT * FROM t2parquet
>>
>> INNER JOIN t1parquet ON t1parquet.key1 =t2parquet.key1);
>>
>>
>>
>> This last query runs OOM on PARQUET_ROW_GROUP_SCAN
>>
>>
>>
>> I use embedded mode upon Windows, File system storage,64MB parquet block
>> size, not so big files (less hundreds of MB in raw format)
>>
>>
>>
>>
>>
>> Does the way Drill / Parquet work implies to prefer queries/ views on raw
>> files to save memory rather than parquet ? Does this behavior isnormal ?
>>
>> Do you think I my memory configuration should by tunedor does I miss
>> understand something ?
>>
>>
>>
>> Thanks in advance, and sorry for my english
>>
>> Regards
>>
>> Boris
>>
>>
>
>