RE: RPC Timeout and Abnormally Long JvmGcTime

Wes Holler Fri, 29 Apr 2016 16:12:07 -0700

Oops... The "adasdasdasdasd" below is JSON scraped from the status API endpoint.


-----Original Message-----
From: Wes Holler 
Sent: Friday, April 29, 2016 6:09 PM
To: dev
Subject: RPC Timeout and Abnormally Long JvmGcTime

Recently we switched to EMR 4.5/Spark 1.6.1 and have since encountered a new 
failure scenario.

The primary symptom is that the cluster appears to be stalled. The job has not 
failed but will not proceed and has to be killed. One or more 
RpcTimeoutException s (see below) are usually found towards the bottom of the 
log. Another observation is that, using the status API, I can see that the 
jvmGcTime value (see JSON snippet below) for some tasks in a stage that never 
finished is abnormally large (in the millions). Even for identical 
queries/data, this problem doesn't always happen, and when the cluster doesn't 
hang, I also do not see these large values for jvmGcTime.

Some questions:
1. What units is jvmGcTime in?
2. If it is in milliseconds, then that seems to indicate garbage collections 
times into the tens of minutes - far longer that the queries themselves usually 
take and also longer than the default spark.rpc.askTimeout value of 120s. Could 
this be blocking/starving the promise and causing the failure?
3. We are doing almost entirely standard SQL, no UDFs, and no use of the RDD 
API. I was under the impression that in Spark 1.6, pretty much everything 
DataFrames is using the custom Tungsten serialization and unsafe buffers, even 
in the actual processing (so no actual unroll). Is this correct? What could be 
causing such GC churn?

===============================================================================

The exception found in the log (several stack trace lines removed for brevity:

16/04/29 17:43:29 ERROR ContextCleaner: Error cleaning broadcast 0
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 
seconds]. This timeout is controlled by spark.rpc.askTimeout
        at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
        .
        at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 
seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        .
        ... 12 more
16/04/29 17:43:29 WARN BlockManagerMaster: Failed to remove broadcast 0 with 
removeFromMaster = true - Cannot receive any reply in 120 seconds. This timeout 
is controlled by spark.rpc.askTimeout
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
seconds. This timeout is controlled by spark.rpc.askTimeout
        at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
        .
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 
120 seconds
        at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
        ... 7 more
16/04/29 17:43:29 INFO ContextCleaner: Cleaned accumulator 2
16/04/29 19:33:23 INFO PackagesResourceConfig: Scanning for root resource and 
provider classes in the packages:
  org.apache.spark.status.api.v1
16/04/29 19:33:28 INFO ScanningResourceConfig: Root resource classes found:
  class org.apache.spark.status.api.v1.ApiRootResource
16/04/29 19:33:28 INFO ScanningResourceConfig: Provider classes found:
  class org.apache.spark.status.api.v1.JacksonMessageWriter
16/04/29 19:33:28 INFO WebApplicationImpl: Initiating Jersey application, 
version 'Jersey: 1.9 09/02/2011 11:17 AM'
16/04/29 19:34:09 WARN Errors: The following warnings have been detected with 
resource and/or provider classes:
  WARNING: A sub-resource method, public scala.collection.Seq 
org.apache.spark.status.api.v1.OneStageResource.stageData(int), with URI 
template, "", is treated as a resource method

===============================================================================

A sample from the adasdasdasdasd:

"1446406" : {
      "taskId" : 1446406,
      "index" : 593,
      "attempt" : 0,
      "launchTime" : "2016-04-27T11:16:25.005GMT",
      "executorId" : "364",
      "host" : "ip-10-0-0-41.ec2.internal",
      "taskLocality" : "RACK_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 0,
        "executorRunTime" : 0,
        "resultSize" : 0,
        "jvmGcTime" : 1637765,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "inputMetrics" : {
          "bytesRead" : 40824341,
          "recordsRead" : 1130000
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      }
    },



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

RE: RPC Timeout and Abnormally Long JvmGcTime

Reply via email to