[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406868#comment-15406868 ] Apache Spark commented on SPARK-16826: -- User 'sylvinus' has created a pull request for this issue: https://github.com/apache/spark/pull/14488 > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403262#comment-15403262 ] Sylvain Zimmer commented on SPARK-16826: [~srowen] what about this? https://github.com/sylvinus/spark/commit/98119a08368b1cd1faf3f25a32910ad6717c5c02 The tests seem to pass and I don't think it uses the problematic code paths in java.net.URL (except for getFile but that may be could be fixed easily) > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403236#comment-15403236 ] Sylvain Zimmer commented on SPARK-16826: Sorry I can't be more helpful on the Java side... But I think there must be some high-quality URL parsing code somewhere in the Apache foundation already :-) > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403222#comment-15403222 ] Sean Owen commented on SPARK-16826: --- URI.toURL just follows the same code path. Does URI itself parse all the same fields? Didn't think so because URIs are a superset of URLs. Definitely open to suggestions. Anything that can parse the same fields respectably is OK. > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403165#comment-15403165 ] Sylvain Zimmer commented on SPARK-16826: [~srowen] thanks for the pointers! I'm parsing every hyperlink found in Common Crawl, so there are billions of unique ones, no way around it. Wouldn't it be possible to switch to another implementation with an API similar to java.net.URL? As I understand it we never need the URLStreamHandler in the first place anyway? I'm not a Java expert but what about {java.net.URI} or {org.apache.catalina.util.URL} for instance? > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402110#comment-15402110 ] Sean Owen commented on SPARK-16826: --- The contention is in the JDK class java.net.URL itself, and it does look like an unfortunate bottleneck from ages ago. There's a static, globally synchronized Hashtable here, which explains why multiple JVM (executors) alleviates the problem. We can't fix that directly. See also http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck We also probably have to rely on the behavior of this class to parse a URL. I don't see a good alternative here? You can avoid the lookup if you pass in the URLStreamHandler manually which would mean reimplementing some of the same logic in URL. That then means the overhead of looking up a SecurityManager though. What about just avoiding a load of calls to parseURL at once, is that at all reasonable? like rearranging the pipeline to filter or remove duplicates earlier upstream? > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org