Sure, This is what I remember: * Failure - embedded mode on my laptop - drill memory: 2Gb/4Gb (heap/direct) - cpu: 4cores (+hyperthreading) - `planner.width.max_per_node=6`
* Success - AWS Cluster 2x c3.8xlarge - drill memory: 16Gb/32Gb - cpu: limited by kubernetes to 24cores - `planner.width.max_per_node=23`I'm very busy right now to test again, but I'll try to provide better info as soon as I can.
On Wed, Aug 31, 2016 at 05:38:53PM +0530, Khurram Faraaz wrote:
Can you please share the number of cores on the setup where the query hung as compared to the number of cores on the setup where the query went through successfully. And details of memory from the two scenarios. Thanks, Khurram On Wed, Aug 31, 2016 at 4:50 PM, Oscar Morante <[email protected]> wrote:For the record, I think this was just bad memory configuration after all. I retested on bigger machines and everything seems to be working fine. On Tue, Aug 09, 2016 at 10:46:33PM +0530, Khurram Faraaz wrote:Oscar, can you please report a JIRA with the required steps to reproduce the OOM error. That way someone from the Drill team will take a look and investigate. For others interested here is the stack trace. 2016-08-09 16:51:14,280 [285642de-ab37-de6e-a54c-378aaa4ce50e:foreman] ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, exiting. Information message: Unable to handle out of memory condition in Foreman. java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:2694) ~[na:1.7.0_111] at java.lang.String.<init>(String.java:203) ~[na:1.7.0_111] at java.lang.StringBuilder.toString(StringBuilder.java:405) ~[na:1.7.0_111] at org.apache.calcite.util.Util.newInternal(Util.java:785) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch( VolcanoRuleCall.java:251) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp( VolcanoPlanner.java:808) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler .transform(DefaultSqlHandler.java:404) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler .transform(DefaultSqlHandler.java:343) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler .convertToDrel(DefaultSqlHandler.java:240) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler .convertToDrel(DefaultSqlHandler.java:290) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.ExplainHandler.ge tPlan(ExplainHandler.java:61) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(Dri llSqlWorker.java:94) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:978) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java: 257) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool Executor.java:1145) [na:1.7.0_111] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo lExecutor.java:615) [na:1.7.0_111] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111] Thanks, Khurram On Tue, Aug 9, 2016 at 7:46 PM, Oscar Morante <[email protected]> wrote: Yeah, when I uncomment only the `upload_date` lines (a dir0 alias),explain succeeds within ~30s. Enabling any of the other lines triggers the failure. This is a log with the `upload_date` lines and `usage <> 'Test'` enabled: https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022b3c55e The client times out around here (~1.5hours): https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022 b3c55e#file-drillbit-log-L178 And it still keeps running for a while until it dies (~2.5hours): https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022 b3c55e#file-drillbit-log-L178 The memory settings for this test were: DRILL_HEAP="4G" DRILL_MAX_DIRECT_MEMORY="8G" This is on a laptop with 16G and I should probably lower it, but it seems a bit excessive for such a small query. And I think I got the same results on a 2 node cluster with 8/16. I'm gonna try again on the cluster to make sure. Thanks, Oscar On Tue, Aug 09, 2016 at 04:13:17PM +0530, Khurram Faraaz wrote: You mentioned "*But if I uncomment the where clause then it runs for acouple of hours until it runs out of memory.*" Can you please share the OutOfMemory details from drillbit.log and the value of DRILL_MAX_DIRECT_MEMORY Can you also try to see what happens if you retain just this line where upload_date = '2016-08-01' in your where clause, can you check if the explain succeeds. Thanks, Khurram On Tue, Aug 9, 2016 at 4:00 PM, Oscar Morante <[email protected]> wrote: Hi there,I've been stuck with this for a while and I'm not sure if I'm running into a bug or I'm just doing something very wrong. I have this stripped-down version of my query: https://gist.github.com/spacepluk/9ab1e1a0cfec6f0efb298f023f4c805b The data is just a single file with one record (1.5K). Without changing anything, explain takes ~1sec on my machine. But if I uncomment the where clause then it runs for a couple of hours until it runs out of memory. Also if I uncomment the where clause *and* take out the join, then it takes around 30s to plan. Any ideas? Thanks!
signature.asc
Description: Digital signature
