Sylvain Zimmer created SPARK-16826:
--------------------------------------

             Summary: java.util.Hashtable limits the throughput of PARSE_URL()
                 Key: SPARK-16826
                 URL: https://issues.apache.org/jira/browse/SPARK-16826
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Sylvain Zimmer


Hello!

I'm using {c4.8xlarge} instances on EC2 with 36 cores and doing lots of 
{parse_url(url, "host")} in Spark SQL.

Unfortunately it seems that there is an internal thread-safe cache in there, 
and the instances end up being 90% idle.

When I view the thread dump for my executors, most of the 36 cores are in 
status "BLOCKED", in that stage:
{code}
java.util.Hashtable.get(Hashtable.java:362)
java.net.URL.getURLStreamHandler(URL.java:1135)
java.net.URL.<init>(URL.java:599)
java.net.URL.<init>(URL.java:490)
java.net.URL.<init>(URL.java:439)
org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
{code}

However, when I switch from 1 executor with 36 cores to 9 executors with 4 
cores, throughput is almost 10x higher and the CPUs are back at ~100% use.

Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to