Sure,
This is what I remember:

* Failure
   - embedded mode on my laptop
   - drill memory: 2Gb/4Gb (heap/direct)
   - cpu: 4cores (+hyperthreading)
   - `planner.width.max_per_node=6`

* Success
   - AWS Cluster 2x c3.8xlarge
   - drill memory: 16Gb/32Gb
   - cpu: limited by kubernetes to 24cores
   - `planner.width.max_per_node=23`

I'm very busy right now to test again, but I'll try to provide better info as soon as I can.


On Wed, Aug 31, 2016 at 05:38:53PM +0530, Khurram Faraaz wrote:
Can you please share the number of cores on the setup where the query hung
as compared to the number of cores on the setup where the query went
through successfully.
And details of memory from the two scenarios.

Thanks,
Khurram

On Wed, Aug 31, 2016 at 4:50 PM, Oscar Morante <[email protected]> wrote:

For the record, I think this was just bad memory configuration after all.
I retested on bigger machines and everything seems to be working fine.


On Tue, Aug 09, 2016 at 10:46:33PM +0530, Khurram Faraaz wrote:

Oscar, can you please report a JIRA with the required steps to reproduce
the OOM error. That way someone from the Drill team will take a look and
investigate.

For others interested here is the stack trace.

2016-08-09 16:51:14,280 [285642de-ab37-de6e-a54c-378aaa4ce50e:foreman]
ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure
Occurred,
exiting. Information message: Unable to handle out of memory condition in
Foreman.
java.lang.OutOfMemoryError: Java heap space
       at java.util.Arrays.copyOfRange(Arrays.java:2694) ~[na:1.7.0_111]
       at java.lang.String.<init>(String.java:203) ~[na:1.7.0_111]
       at java.lang.StringBuilder.toString(StringBuilder.java:405)
~[na:1.7.0_111]
       at org.apache.calcite.util.Util.newInternal(Util.java:785)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
       at
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(
VolcanoRuleCall.java:251)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
       at
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(
VolcanoPlanner.java:808)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
       at
org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
       at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
.transform(DefaultSqlHandler.java:404)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
       at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
.transform(DefaultSqlHandler.java:343)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
       at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
.convertToDrel(DefaultSqlHandler.java:240)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
       at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
.convertToDrel(DefaultSqlHandler.java:290)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
       at
org.apache.drill.exec.planner.sql.handlers.ExplainHandler.ge
tPlan(ExplainHandler.java:61)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
       at
org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(Dri
llSqlWorker.java:94)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
       at
org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:978)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
       at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:
257)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
       at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1145)
[na:1.7.0_111]
       at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:615)
[na:1.7.0_111]
       at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]

Thanks,
Khurram

On Tue, Aug 9, 2016 at 7:46 PM, Oscar Morante <[email protected]>
wrote:

Yeah, when I uncomment only the `upload_date` lines (a dir0 alias),
explain succeeds within ~30s.  Enabling any of the other lines triggers
the
failure.

This is a log with the `upload_date` lines and `usage <> 'Test'` enabled:
https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022b3c55e

The client times out around here (~1.5hours):
https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022
b3c55e#file-drillbit-log-L178

And it still keeps running for a while until it dies (~2.5hours):
https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022
b3c55e#file-drillbit-log-L178

The memory settings for this test were:

   DRILL_HEAP="4G"
   DRILL_MAX_DIRECT_MEMORY="8G"

This is on a laptop with 16G and I should probably lower it, but it seems
a bit excessive for such a small query.  And I think I got the same
results
on a 2 node cluster with 8/16.  I'm gonna try again on the cluster to
make
sure.

Thanks,
Oscar


On Tue, Aug 09, 2016 at 04:13:17PM +0530, Khurram Faraaz wrote:

You mentioned "*But if I uncomment the where clause then it runs for a
couple of hours until it runs out of memory.*"

Can you please share the OutOfMemory details from drillbit.log and the
value of DRILL_MAX_DIRECT_MEMORY

Can you also try to see what happens if you retain just this line where
upload_date = '2016-08-01' in your where clause, can you check if the
explain succeeds.

Thanks,
Khurram

On Tue, Aug 9, 2016 at 4:00 PM, Oscar Morante <[email protected]>
wrote:

Hi there,

I've been stuck with this for a while and I'm not sure if I'm running
into
a bug or I'm just doing something very wrong.

I have this stripped-down version of my query:
https://gist.github.com/spacepluk/9ab1e1a0cfec6f0efb298f023f4c805b

The data is just a single file with one record (1.5K).

Without changing anything, explain takes ~1sec on my machine.  But if I
uncomment the where clause then it runs for a couple of hours until it
runs
out of memory.

Also if I uncomment the where clause *and* take out the join, then it
takes around 30s to plan.

Any ideas?
Thanks!

Attachment: signature.asc
Description: Digital signature

Reply via email to