Re: Fetch Failure
i've eliminated fetch failed with this parameters (don't know which was the right one for the problem) to the spark-submit running with 1.2.0 --conf spark.shuffle.compress=false \ --conf spark.file.transferTo=false \ --conf spark.shuffle.manager=hash \ --conf spark.akka.frameSize=50 \ --conf spark.core.connection.ack.wait.timeout=600 ..but me too i'm unable to finish a job...now i'm facing OOM's...still trying...but at least fetch failed are gone bye Il 23/12/2014 21:10, Chen Song ha scritto: I tried both 1.1.1 and 1.2.0 (built against cdh5.1.0 and hadoop2.3) but I am still seeing FetchFailedException. On Mon, Dec 22, 2014 at 8:27 AM, steghe stefano.ghe...@icteam.it mailto:stefano.ghe...@icteam.it wrote: Which version of spark are you running? It could be related to this https://issues.apache.org/jira/browse/SPARK-3633 fixed in 1.1.1 and 1.2.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org -- Chen Song -- Stefano Ghezzi ICTeam S.p.A Project Manager - PMP tel 035 4232129fax 035 4522034 email stefano.ghe...@icteam.it url http://www.icteam.com mobile 335 7308587
Re: Fetch Failure
Which version of spark are you running? It could be related to this https://issues.apache.org/jira/browse/SPARK-3633 fixed in 1.1.1 and 1.2.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Fetch Failure
I have a job that runs fine on relatively small input datasets but then reaches a threshold where I begin to consistently get Fetch failure for the Failure Reason, late in the job, during a saveAsText() operation. The first error we are seeing on the Details for Stage page is ExecutorLostFailure My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have three servers and they are configured on this job for 5g memory, and the job is running in spark-shell. The first error in the shell is Lost executor 2 on (servername): remote Akka client disassociated. We are still trying to understand how to best diagnose jobs using the web ui so it's likely that there's some helpful info here that we just don't know how to interpret -- is there any kind of troubleshooting guide beyond the Spark Configuration page? I don't know if I'm providing enough info here. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fetch Failure
.compute.internal 4 I've analyzed some heap dumps and see nothing out of the ordinary. Would love to know what could be causing this. On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote: I have a job that runs fine on relatively small input datasets but then reaches a threshold where I begin to consistently get Fetch failure for the Failure Reason, late in the job, during a saveAsText() operation. The first error we are seeing on the Details for Stage page is ExecutorLostFailure My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have three servers and they are configured on this job for 5g memory, and the job is running in spark-shell. The first error in the shell is Lost executor 2 on (servername): remote Akka client disassociated. We are still trying to understand how to best diagnose jobs using the web ui so it's likely that there's some helpful info here that we just don't know how to interpret -- is there any kind of troubleshooting guide beyond the Spark Configuration page? I don't know if I'm providing enough info here. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fetch Failure
-Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 I've analyzed some heap dumps and see nothing out of the ordinary. Would love to know what could be causing this. On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote: I have a job that runs fine on relatively small input datasets but then reaches a threshold where I begin to consistently get Fetch failure for the Failure Reason, late in the job, during a saveAsText() operation. The first error we are seeing on the Details for Stage page is ExecutorLostFailure My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have three servers and they are configured on this job for 5g memory, and the job is running in spark-shell. The first error in the shell is Lost executor 2 on (servername): remote Akka client disassociated. We are still trying to understand how to best diagnose jobs using the web ui so it's likely that there's some helpful info here that we just don't know how to interpret -- is there any kind of troubleshooting guide beyond the Spark Configuration page? I don't know if I'm providing enough info here. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fetch Failure
/container_1418928607193_0011_01_02/stdout 2 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 I've analyzed some heap dumps and see nothing out of the ordinary. Would love to know what could be causing this. On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote: I have a job that runs fine on relatively small input datasets but then reaches a threshold where I begin to consistently get Fetch failure for the Failure Reason, late in the job, during a saveAsText() operation. The first error we are seeing on the Details for Stage page is ExecutorLostFailure My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have three servers and they are configured on this job for 5g memory, and the job is running in spark-shell. The first error in the shell is Lost executor 2 on (servername): remote Akka client disassociated. We are still trying to understand how to best diagnose jobs using the web ui so it's likely that there's some helpful info here that we just don't know how to interpret -- is there any kind of troubleshooting guide beyond the Spark Configuration page? I don't know if I'm providing enough info here. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fetch Failure
/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stdout 2 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 I've analyzed some heap dumps and see nothing out of the ordinary. Would love to know what could be causing this. On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote: I have a job that runs fine on relatively small input datasets but then reaches a threshold where I begin to consistently get Fetch failure for the Failure Reason, late in the job, during a saveAsText() operation. The first error we are seeing on the Details for Stage page is ExecutorLostFailure My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have three servers and they are configured on this job for 5g memory, and the job is running in spark-shell. The first error in the shell is Lost executor 2 on (servername): remote Akka client disassociated. We are still trying to understand how to best diagnose jobs using the web ui so it's likely that there's some helpful info here that we just don't know how to interpret -- is there any kind of troubleshooting guide beyond the Spark Configuration page? I don't know if I'm providing enough info here. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fetch Failure
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=24273,containerID=container_1418928607193_0011_01_02] is running beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1418928607193_0011_01_02 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stdout 2 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 I've analyzed some heap dumps and see nothing out of the ordinary. Would love to know what could be causing this. On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote: I have a job that runs fine on relatively small input datasets but then reaches a threshold where I begin to consistently get Fetch failure for the Failure Reason, late in the job, during a saveAsText() operation. The first error we are seeing on the Details for Stage page is ExecutorLostFailure My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have three servers and they are configured on this job for 5g memory, and the job is running in spark-shell. The first error in the shell is Lost executor 2 on (servername): remote Akka client disassociated. We are still trying to understand how to best diagnose jobs using the web ui so it's likely that there's some helpful info here that we just don't know how to interpret -- is there any kind of troubleshooting guide beyond the Spark Configuration page? I don't know if I'm providing enough info here. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fetch Failure
org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stdout 2 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 I've analyzed some heap dumps and see nothing out of the ordinary. Would love to know what could be causing this. On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote: I have a job that runs fine on relatively small input datasets but then reaches a threshold where I begin to consistently get Fetch failure for the Failure Reason, late in the job, during a saveAsText() operation. The first error we are seeing on the Details for Stage page is ExecutorLostFailure My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have three servers and they are configured on this job for 5g memory, and the job is running in spark-shell. The first error in the shell is Lost executor 2 on (servername): remote Akka client disassociated. We are still trying to understand how to best diagnose jobs using the web ui so it's likely that there's some helpful info here that we just don't know how to interpret -- is there any kind of troubleshooting guide beyond the Spark Configuration page? I don't know if I'm providing enough info here. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html Sent from the Apache Spark User List mailing list archive at Nabble.com . - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
many fetch failure in BlockManager
*HI ALL:* *My job is cpu intensive, and its resource configuration is 400 worker * 1 core * 3G. There are many fetch failure, like:* 14-08-23 08:34:52 WARN [Result resolver thread-3] TaskSetManager: Loss was due to fetch failure from BlockManagerId(slave1:33500) 14-08-23 08:34:52 INFO [spark-akka.actor.default-dispatcher-37] DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for resubmision due to a fetch failure 14-08-23 08:34:52 INFO [spark-akka.actor.default-dispatcher-37] DAGScheduler: The failed fetch was from Stage 5 (repartition at test.scala:82); marking it for resubmission 14-08-23 08:34:53 INFO [spark-akka.actor.default-dispatcher-71] DAGScheduler: Resubmitting failed stages 14-08-23 08:35:06 WARN [Result resolver thread-2] TaskSetManager: Loss was due to fetch failure from BlockManagerId(slave2:34792) 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63] DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for resubmision due to a fetch failure 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63] DAGScheduler: The failed fetch was from Stage 5 (repartition at test.scala:82); marking it for resubmission 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63] DAGScheduler: Executor lost: 118 (epoch 3) 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-38] BlockManagerMasterActor: Trying to remove executor 118 from BlockManagerMaster. 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63] BlockManagerMaster: Removed 118 successfully in removeExecutor 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-43] DAGScheduler: Resubmitting failed stages *stage 4 will be marked for resubmission. After a period of time: block manager slave1:33500 will be registered again* 14-08-23 08:36:16 INFO [spark-akka.actor.default-dispatcher-58] BlockManagerInfo: Registering block manager slave1:33500 with 1766.4 MB RAM *unfortunately, stage 4 will be resubmitted again and again, and meet many fetch failure. After 14-08-23 09:03:37, there is no log in master, and print log again at 14-08-24 00:43:15* 14-08-23 09:03:37 INFO [Result resolver thread-3] YarnClusterScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool 14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-28] DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for resubmision due to a fetch failure 14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-28] DAGScheduler: The failed fetch was from Stage 5 (repartition at test.scala:82); marking it for resubmission 14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-71] DAGScheduler: Resubmitting failed stages 14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Completed container container_1400565786114_133451_01_000171 (state: COMPLETE, exit status: -100) 14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Container marked as failed: container_1400565786114_133451_01_000171 14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Completed container container_1400565786114_133451_01_000172 (state: COMPLETE, exit status: -100) 14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Container marked as failed: container_1400565786114_133451_01_000172 14-08-24 00:43:20 INFO [Thread-854] ApplicationMaster: Allocating 2 containers to make up for (potentially) lost containers 14-08-24 00:43:20 INFO [Thread-854] YarnAllocationHandler: Will Allocate 2 executor containers, each with 3456 memory *Strangely, TaskSet4.0 will be removed as its tasks have completed, while Stage 4 was marked for resubmission. In Executor there are many java.net.ConnectException: Connection timed out, like:* 14-08-23 08:19:14 WARN [pool-3-thread-1] SendingConnection: Error finishing connection to java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318) at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) *I often meet such problems, i.e. BlockManager Connection Fail, and Spark can not recover effectively, and job will hang or fail directly.* *Any Suggestions? And are there any guides about resource for job in view of computing, cache, shuffle, etc.* *Thank You!*
Lost TID: Loss was due to fetch failure from BlockManagerId
I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 worker). Our app is fetching data from Cassandra and doing a basic filter, map, and countByKey on that data. I have run into a strange problem. Even if the number of rows in Cassandra is just 1M, the Spark job goes seems to go into an infinite loop and runs for hours. With a small amount of data (less than 100 rows), the job does finish, but takes almost 30-40 seconds and we frequently see the messages shown below. If we run the same application on a single node Spark (--master local[4]), then we don't see these warnings and the task finishes in less than 6-7 seconds. Any idea what could be the cause for these problems when we run our application on a standalone 4-node spark cluster? 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90) 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0) 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0) 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34) 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0) 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4) 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0) 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0) 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218) 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1) 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0) 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0) 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0) 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0) 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0) 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9) 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0) 14/06/30 19:30:42 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:43 WARN TaskSetManager: Lost TID 30193 (task 6.18:0) 14/06/30 19:30:43 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:45 WARN TaskSetManager: Lost TID 30559 (task 6.19:98) 14/06/30 19:30:45 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(1, 192.168.222.152, 45896, 0) 14/06/30 19:30:46 WARN TaskSetManager: Lost TID 30826 (task 6.20:0) 14/06/30 19:30:46 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(1, 192.168.222.152, 45896, 0) 14/06/30 19:30:48 WARN TaskSetManager: Lost TID 31098 (task 6.21:0) 14/06/30 19:30:48 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId
Re: Lost TID: Loss was due to fetch failure from BlockManagerId
A lot of things can get funny when you run distributed as opposed to local -- e.g. some jar not making it over. Do you see anything of interest in the log on the executor machines -- I'm guessing 192.168.222.152/192.168.222.164. From here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala seems like the warning message is logged after the task fails -- but I wonder if you might see something more useful as to why it failed to begin with. As an example we've had cases in Hdfs where a small example would work, but on a larger example we'd hit a bad file. But the executor log is usually pretty explicit as to what happened... On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller moham...@glassbeam.com wrote: I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 worker). Our app is fetching data from Cassandra and doing a basic filter, map, and countByKey on that data. I have run into a strange problem. Even if the number of rows in Cassandra is just 1M, the Spark job goes seems to go into an infinite loop and runs for hours. With a small amount of data (less than 100 rows), the job does finish, but takes almost 30-40 seconds and we frequently see the messages shown below. If we run the same application on a single node Spark (--master local[4]), then we don’t see these warnings and the task finishes in less than 6-7 seconds. Any idea what could be the cause for these problems when we run our application on a standalone 4-node spark cluster? 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90) 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0) 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0) 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34) 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0) 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4) 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0) 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0) 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218) 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1) 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0) 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0) 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0) 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0) 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0) 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9) 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0
Re: Lost TID: Loss was due to fetch failure from BlockManagerId
It could be cause you are out of memory on the worker nodes blocks are not getting registered.. A older issue with 0.6.0 was with dead nodes causing loss of task then resubmission of data in an infinite loop... It was fixed in 0.7.0 though. Are you seeing a crash log in this log.. or in the worker log @ 192.168.222.164 or any of the machines where the crash log is displayed. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 2, 2014 at 7:51 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: A lot of things can get funny when you run distributed as opposed to local -- e.g. some jar not making it over. Do you see anything of interest in the log on the executor machines -- I'm guessing 192.168.222.152/192.168.222.164. From here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala seems like the warning message is logged after the task fails -- but I wonder if you might see something more useful as to why it failed to begin with. As an example we've had cases in Hdfs where a small example would work, but on a larger example we'd hit a bad file. But the executor log is usually pretty explicit as to what happened... On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller moham...@glassbeam.com wrote: I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 worker). Our app is fetching data from Cassandra and doing a basic filter, map, and countByKey on that data. I have run into a strange problem. Even if the number of rows in Cassandra is just 1M, the Spark job goes seems to go into an infinite loop and runs for hours. With a small amount of data (less than 100 rows), the job does finish, but takes almost 30-40 seconds and we frequently see the messages shown below. If we run the same application on a single node Spark (--master local[4]), then we don’t see these warnings and the task finishes in less than 6-7 seconds. Any idea what could be the cause for these problems when we run our application on a standalone 4-node spark cluster? 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90) 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0) 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0) 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34) 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0) 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4) 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0) 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0) 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218) 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1) 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0) 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0) 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0) 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from