Re: Query hangs on planning

Sudheesh Katkam Thu, 01 Sep 2016 11:42:07 -0700

That setting is for off-heap memory. The earlier case hit heap memory limit.


> On Sep 1, 2016, at 11:36 AM, Zelaine Fong <[email protected]> wrote:
> 
> One other thing ... have you tried tuning the planner.memory_limit
> parameter?  Based on the earlier stack trace, you're hitting a memory limit
> during query planning.  So, tuning this parameter should help that.  The
> default is 256 MB.
> 
> -- Zelaine
> 
> On Thu, Sep 1, 2016 at 11:21 AM, rahul challapalli <
> [email protected]> wrote:
> 
>> While planning we use heap memory. 2GB of heap should be sufficient for
>> what you mentioned. This looks like a bug to me. Can you raise a jira for
>> the same? And it would be super helpful if you can also attach the data set
>> used.
>> 
>> Rahul
>> 
>> On Wed, Aug 31, 2016 at 9:14 AM, Oscar Morante <[email protected]>
>> wrote:
>> 
>>> Sure,
>>> This is what I remember:
>>> 
>>> * Failure
>>>   - embedded mode on my laptop
>>>   - drill memory: 2Gb/4Gb (heap/direct)
>>>   - cpu: 4cores (+hyperthreading)
>>>   - `planner.width.max_per_node=6`
>>> 
>>> * Success
>>>   - AWS Cluster 2x c3.8xlarge
>>>   - drill memory: 16Gb/32Gb
>>>   - cpu: limited by kubernetes to 24cores
>>>   - `planner.width.max_per_node=23`
>>> 
>>> I'm very busy right now to test again, but I'll try to provide better
>> info
>>> as soon as I can.
>>> 
>>> 
>>> 
>>> On Wed, Aug 31, 2016 at 05:38:53PM +0530, Khurram Faraaz wrote:
>>> 
>>>> Can you please share the number of cores on the setup where the query
>> hung
>>>> as compared to the number of cores on the setup where the query went
>>>> through successfully.
>>>> And details of memory from the two scenarios.
>>>> 
>>>> Thanks,
>>>> Khurram
>>>> 
>>>> On Wed, Aug 31, 2016 at 4:50 PM, Oscar Morante <[email protected]>
>>>> wrote:
>>>> 
>>>> For the record, I think this was just bad memory configuration after
>> all.
>>>>> I retested on bigger machines and everything seems to be working fine.
>>>>> 
>>>>> 
>>>>> On Tue, Aug 09, 2016 at 10:46:33PM +0530, Khurram Faraaz wrote:
>>>>> 
>>>>> Oscar, can you please report a JIRA with the required steps to
>> reproduce
>>>>>> the OOM error. That way someone from the Drill team will take a look
>> and
>>>>>> investigate.
>>>>>> 
>>>>>> For others interested here is the stack trace.
>>>>>> 
>>>>>> 2016-08-09 16:51:14,280 [285642de-ab37-de6e-a54c-
>> 378aaa4ce50e:foreman]
>>>>>> ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure
>>>>>> Occurred,
>>>>>> exiting. Information message: Unable to handle out of memory condition
>>>>>> in
>>>>>> Foreman.
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>       at java.util.Arrays.copyOfRange(Arrays.java:2694)
>>>>>> ~[na:1.7.0_111]
>>>>>>       at java.lang.String.<init>(String.java:203) ~[na:1.7.0_111]
>>>>>>       at java.lang.StringBuilder.toString(StringBuilder.java:405)
>>>>>> ~[na:1.7.0_111]
>>>>>>       at org.apache.calcite.util.Util.newInternal(Util.java:785)
>>>>>> ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
>>>>>>       at
>>>>>> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(
>>>>>> VolcanoRuleCall.java:251)
>>>>>> ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
>>>>>>       at
>>>>>> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(
>>>>>> VolcanoPlanner.java:808)
>>>>>> ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
>>>>>>       at
>>>>>> org.apache.calcite.tools.Programs$RuleSetProgram.run(
>> Programs.java:303)
>>>>>> ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
>>>>>>       at
>>>>>> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
>>>>>> .transform(DefaultSqlHandler.java:404)
>>>>>> ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>>>>>>       at
>>>>>> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
>>>>>> .transform(DefaultSqlHandler.java:343)
>>>>>> ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>>>>>>       at
>>>>>> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
>>>>>> .convertToDrel(DefaultSqlHandler.java:240)
>>>>>> ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>>>>>>       at
>>>>>> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
>>>>>> .convertToDrel(DefaultSqlHandler.java:290)
>>>>>> ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>>>>>>       at
>>>>>> org.apache.drill.exec.planner.sql.handlers.ExplainHandler.ge
>>>>>> tPlan(ExplainHandler.java:61)
>>>>>> ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>>>>>>       at
>>>>>> org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(Dri
>>>>>> llSqlWorker.java:94)
>>>>>> ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>>>>>>       at
>>>>>> org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:978)
>>>>>> ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>>>>>>       at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.
>> java:
>>>>>> 257)
>>>>>> ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
>>>>>>       at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>>> Executor.java:1145)
>>>>>> [na:1.7.0_111]
>>>>>>       at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>>> lExecutor.java:615)
>>>>>> [na:1.7.0_111]
>>>>>>       at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
>>>>>> 
>>>>>> Thanks,
>>>>>> Khurram
>>>>>> 
>>>>>> On Tue, Aug 9, 2016 at 7:46 PM, Oscar Morante <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>> Yeah, when I uncomment only the `upload_date` lines (a dir0 alias),
>>>>>> 
>>>>>>> explain succeeds within ~30s.  Enabling any of the other lines
>> triggers
>>>>>>> the
>>>>>>> failure.
>>>>>>> 
>>>>>>> This is a log with the `upload_date` lines and `usage <> 'Test'`
>>>>>>> enabled:
>>>>>>> https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022b3c55e
>>>>>>> 
>>>>>>> The client times out around here (~1.5hours):
>>>>>>> https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022
>>>>>>> b3c55e#file-drillbit-log-L178
>>>>>>> 
>>>>>>> And it still keeps running for a while until it dies (~2.5hours):
>>>>>>> https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022
>>>>>>> b3c55e#file-drillbit-log-L178
>>>>>>> 
>>>>>>> The memory settings for this test were:
>>>>>>> 
>>>>>>>   DRILL_HEAP="4G"
>>>>>>>   DRILL_MAX_DIRECT_MEMORY="8G"
>>>>>>> 
>>>>>>> This is on a laptop with 16G and I should probably lower it, but it
>>>>>>> seems
>>>>>>> a bit excessive for such a small query.  And I think I got the same
>>>>>>> results
>>>>>>> on a 2 node cluster with 8/16.  I'm gonna try again on the cluster to
>>>>>>> make
>>>>>>> sure.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Oscar
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Aug 09, 2016 at 04:13:17PM +0530, Khurram Faraaz wrote:
>>>>>>> 
>>>>>>> You mentioned "*But if I uncomment the where clause then it runs for
>> a
>>>>>>> 
>>>>>>>> couple of hours until it runs out of memory.*"
>>>>>>>> 
>>>>>>>> Can you please share the OutOfMemory details from drillbit.log and
>> the
>>>>>>>> value of DRILL_MAX_DIRECT_MEMORY
>>>>>>>> 
>>>>>>>> Can you also try to see what happens if you retain just this line
>>>>>>>> where
>>>>>>>> upload_date = '2016-08-01' in your where clause, can you check if
>> the
>>>>>>>> explain succeeds.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Khurram
>>>>>>>> 
>>>>>>>> On Tue, Aug 9, 2016 at 4:00 PM, Oscar Morante <[email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi there,
>>>>>>>> 
>>>>>>>> I've been stuck with this for a while and I'm not sure if I'm
>> running
>>>>>>>>> into
>>>>>>>>> a bug or I'm just doing something very wrong.
>>>>>>>>> 
>>>>>>>>> I have this stripped-down version of my query:
>>>>>>>>> https://gist.github.com/spacepluk/9ab1e1a0cfec6f0efb298f023f4c805b
>>>>>>>>> 
>>>>>>>>> The data is just a single file with one record (1.5K).
>>>>>>>>> 
>>>>>>>>> Without changing anything, explain takes ~1sec on my machine.  But
>>>>>>>>> if I
>>>>>>>>> uncomment the where clause then it runs for a couple of hours until
>>>>>>>>> it
>>>>>>>>> runs
>>>>>>>>> out of memory.
>>>>>>>>> 
>>>>>>>>> Also if I uncomment the where clause *and* take out the join, then
>> it
>>>>>>>>> takes around 30s to plan.
>>>>>>>>> 
>>>>>>>>> Any ideas?
>>>>>>>>> Thanks!
>>>>>>>>> 
>>>>>>>>> 
>>

Re: Query hangs on planning

Reply via email to