Re: Fetch Failure

2014-12-23 Thread Stefano Ghezzi
i've eliminated fetch failed with this parameters (don't know which was 
the right one for the problem)

to the spark-submit running with 1.2.0

--conf spark.shuffle.compress=false \
--conf spark.file.transferTo=false \
--conf spark.shuffle.manager=hash \
--conf spark.akka.frameSize=50 \
--conf spark.core.connection.ack.wait.timeout=600

..but me too i'm unable to finish a job...now i'm facing OOM's...still 
trying...but at

least fetch failed are gone

bye

Il 23/12/2014 21:10, Chen Song ha scritto:
I tried both 1.1.1 and 1.2.0 (built against cdh5.1.0 and hadoop2.3) 
but I am still seeing FetchFailedException.


On Mon, Dec 22, 2014 at 8:27 AM, steghe stefano.ghe...@icteam.it 
mailto:stefano.ghe...@icteam.it wrote:


Which version of spark are you running?

It could be related to this
https://issues.apache.org/jira/browse/SPARK-3633

fixed in 1.1.1 and 1.2.0





--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org




--
Chen Song




--

Stefano Ghezzi ICTeam S.p.A
Project Manager - PMP
tel 035 4232129fax 035 4522034
email   stefano.ghe...@icteam.it   url http://www.icteam.com
mobile  335 7308587




Re: Fetch Failure

2014-12-22 Thread steghe
Which version of spark are you running?

It could be related to this
https://issues.apache.org/jira/browse/SPARK-3633

fixed in 1.1.1 and 1.2.0





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Fetch Failure

2014-12-19 Thread bethesda
I have a job that runs fine on relatively small input datasets but then
reaches a threshold where I begin to consistently get Fetch failure for
the Failure Reason, late in the job, during a saveAsText() operation. 

The first error we are seeing on the Details for Stage page is
ExecutorLostFailure

My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have
three servers and they are configured on this job for 5g memory, and the job
is running in spark-shell.  The first error in the shell is Lost executor 2
on (servername): remote Akka client disassociated.

We are still trying to understand how to best diagnose jobs using the web ui
so it's likely that there's some helpful info here that we just don't know
how to interpret -- is there any kind of troubleshooting guide beyond the
Spark Configuration page?  I don't know if I'm providing enough info here.

thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Fetch Failure

2014-12-19 Thread Jon Chase
.compute.internal 4


I've analyzed some heap dumps and see nothing out of the ordinary.   Would
love to know what could be causing this.


On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote:

 I have a job that runs fine on relatively small input datasets but then
 reaches a threshold where I begin to consistently get Fetch failure for
 the Failure Reason, late in the job, during a saveAsText() operation.

 The first error we are seeing on the Details for Stage page is
 ExecutorLostFailure

 My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
 have
 three servers and they are configured on this job for 5g memory, and the
 job
 is running in spark-shell.  The first error in the shell is Lost executor
 2
 on (servername): remote Akka client disassociated.

 We are still trying to understand how to best diagnose jobs using the web
 ui
 so it's likely that there's some helpful info here that we just don't know
 how to interpret -- is there any kind of troubleshooting guide beyond the
 Spark Configuration page?  I don't know if I'm providing enough info here.

 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Fetch Failure

2014-12-19 Thread sandy . ryza
 
 -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails 
 -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
  org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
  1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 
 
 
 I've analyzed some heap dumps and see nothing out of the ordinary.   Would 
 love to know what could be causing this.
 
 
 On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote:
 I have a job that runs fine on relatively small input datasets but then
 reaches a threshold where I begin to consistently get Fetch failure for
 the Failure Reason, late in the job, during a saveAsText() operation.
 
 The first error we are seeing on the Details for Stage page is
 ExecutorLostFailure
 
 My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have
 three servers and they are configured on this job for 5g memory, and the job
 is running in spark-shell.  The first error in the shell is Lost executor 2
 on (servername): remote Akka client disassociated.
 
 We are still trying to understand how to best diagnose jobs using the web ui
 so it's likely that there's some helpful info here that we just don't know
 how to interpret -- is there any kind of troubleshooting guide beyond the
 Spark Configuration page?  I don't know if I'm providing enough info here.
 
 thanks.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


Re: Fetch Failure

2014-12-19 Thread Jon Chase
/container_1418928607193_0011_01_02/stdout
 2
 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr
 |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
 -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4


 I've analyzed some heap dumps and see nothing out of the ordinary.   Would
 love to know what could be causing this.


 On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote:

 I have a job that runs fine on relatively small input datasets but then
 reaches a threshold where I begin to consistently get Fetch failure for
 the Failure Reason, late in the job, during a saveAsText() operation.

 The first error we are seeing on the Details for Stage page is
 ExecutorLostFailure

 My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
 have
 three servers and they are configured on this job for 5g memory, and the
 job
 is running in spark-shell.  The first error in the shell is Lost
 executor 2
 on (servername): remote Akka client disassociated.

 We are still trying to understand how to best diagnose jobs using the web
 ui
 so it's likely that there's some helpful info here that we just don't know
 how to interpret -- is there any kind of troubleshooting guide beyond
 the
 Spark Configuration page?  I don't know if I'm providing enough info here.

 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Fetch Failure

2014-12-19 Thread Sandy Ryza
/CoarseGrainedScheduler
 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1
 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stdout
 2
 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr
 |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
 -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4


 I've analyzed some heap dumps and see nothing out of the ordinary.
 Would love to know what could be causing this.


 On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote:

 I have a job that runs fine on relatively small input datasets but then
 reaches a threshold where I begin to consistently get Fetch failure for
 the Failure Reason, late in the job, during a saveAsText() operation.

 The first error we are seeing on the Details for Stage page is
 ExecutorLostFailure

 My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
 have
 three servers and they are configured on this job for 5g memory, and the
 job
 is running in spark-shell.  The first error in the shell is Lost
 executor 2
 on (servername): remote Akka client disassociated.

 We are still trying to understand how to best diagnose jobs using the
 web ui
 so it's likely that there's some helpful info here that we just don't
 know
 how to interpret -- is there any kind of troubleshooting guide beyond
 the
 Spark Configuration page?  I don't know if I'm providing enough info
 here.

 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: Fetch Failure

2014-12-19 Thread Jon Chase
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Container
 [pid=24273,containerID=container_1418928607193_0011_01_02] is running
 beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
 memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
 Dump of the process-tree for container_1418928607193_0011_01_02 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1
 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stdout
 2
 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr
 |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
 -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4


 I've analyzed some heap dumps and see nothing out of the ordinary.
 Would love to know what could be causing this.


 On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote:

 I have a job that runs fine on relatively small input datasets but then
 reaches a threshold where I begin to consistently get Fetch failure for
 the Failure Reason, late in the job, during a saveAsText() operation.

 The first error we are seeing on the Details for Stage page is
 ExecutorLostFailure

 My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
 have
 three servers and they are configured on this job for 5g memory, and the
 job
 is running in spark-shell.  The first error in the shell is Lost
 executor 2
 on (servername): remote Akka client disassociated.

 We are still trying to understand how to best diagnose jobs using the
 web ui
 so it's likely that there's some helpful info here that we just don't
 know
 how to interpret -- is there any kind of troubleshooting guide beyond
 the
 Spark Configuration page?  I don't know if I'm providing enough info
 here.

 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: Fetch Failure

2014-12-19 Thread Jon Chase
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1
 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stdout
 2
 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr
 |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
 -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4


 I've analyzed some heap dumps and see nothing out of the ordinary.
 Would love to know what could be causing this.


 On Fri, Dec 19, 2014 at 7:46 AM, bethesda swearinge...@mac.com wrote:

 I have a job that runs fine on relatively small input datasets but then
 reaches a threshold where I begin to consistently get Fetch failure
 for
 the Failure Reason, late in the job, during a saveAsText() operation.

 The first error we are seeing on the Details for Stage page is
 ExecutorLostFailure

 My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
 have
 three servers and they are configured on this job for 5g memory, and
 the job
 is running in spark-shell.  The first error in the shell is Lost
 executor 2
 on (servername): remote Akka client disassociated.

 We are still trying to understand how to best diagnose jobs using the
 web ui
 so it's likely that there's some helpful info here that we just don't
 know
 how to interpret -- is there any kind of troubleshooting guide beyond
 the
 Spark Configuration page?  I don't know if I'm providing enough info
 here.

 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com
 .

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







many fetch failure in BlockManager

2014-08-25 Thread 余根茂
*HI ALL:*


*My job is cpu intensive, and its resource configuration is 400 worker
* 1 core * 3G. There are many fetch failure, like:*



14-08-23 08:34:52 WARN [Result resolver thread-3] TaskSetManager: Loss
was due to fetch failure from BlockManagerId(slave1:33500)

14-08-23 08:34:52 INFO [spark-akka.actor.default-dispatcher-37]
DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for
resubmision due to a fetch failure

14-08-23 08:34:52 INFO [spark-akka.actor.default-dispatcher-37]
DAGScheduler: The failed fetch was from Stage 5 (repartition at
test.scala:82); marking it for resubmission

14-08-23 08:34:53 INFO [spark-akka.actor.default-dispatcher-71]
DAGScheduler: Resubmitting failed stages

14-08-23 08:35:06 WARN [Result resolver thread-2] TaskSetManager: Loss
was due to fetch failure from BlockManagerId(slave2:34792)

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for
resubmision due to a fetch failure

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
DAGScheduler: The failed fetch was from Stage 5 (repartition at
test.scala:82); marking it for resubmission

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
DAGScheduler: Executor lost: 118 (epoch 3)

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-38]
BlockManagerMasterActor: Trying to remove executor 118 from
BlockManagerMaster.

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
BlockManagerMaster: Removed 118 successfully in removeExecutor

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-43]
DAGScheduler: Resubmitting failed stages

*stage 4 will be marked for resubmission. After a period of time:
block manager slave1:33500 will be registered again*

14-08-23 08:36:16 INFO [spark-akka.actor.default-dispatcher-58]
BlockManagerInfo: Registering block manager slave1:33500 with 1766.4
MB RAM

*unfortunately, stage 4 will be resubmitted again and again, and meet
many fetch failure. After 14-08-23 09:03:37, there is no log in
master, and print log again at  14-08-24 00:43:15*

14-08-23 09:03:37 INFO [Result resolver thread-3]
YarnClusterScheduler: Removed TaskSet 4.0, whose tasks have all
completed, from pool

14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-28]
DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for
resubmision due to a fetch failure

14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-28]
DAGScheduler: The failed fetch was from Stage 5 (repartition at
test.scala:82); marking it for resubmission

14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-71]
DAGScheduler: Resubmitting failed stages

14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Completed
container container_1400565786114_133451_01_000171 (state: COMPLETE,
exit status: -100)

14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Container
marked as failed: container_1400565786114_133451_01_000171

14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Completed
container container_1400565786114_133451_01_000172 (state: COMPLETE,
exit status: -100)

14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Container
marked as failed: container_1400565786114_133451_01_000172

14-08-24 00:43:20 INFO [Thread-854] ApplicationMaster: Allocating 2
containers to make up for (potentially) lost containers

14-08-24 00:43:20 INFO [Thread-854] YarnAllocationHandler: Will
Allocate 2 executor containers, each with 3456 memory

*Strangely, TaskSet4.0 will be removed as its tasks have completed,
while Stage 4 was marked for resubmission. In Executor there are many
java.net.ConnectException: Connection timed out, like:*


14-08-23 08:19:14 WARN [pool-3-thread-1] SendingConnection: Error
finishing connection to java.net.ConnectException: Connection timed
out

 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

 at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)

 at 
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318)

 at 
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

 at java.lang.Thread.run(Thread.java:662)


 *I often meet such problems, i.e. BlockManager Connection Fail, and
Spark can not recover effectively, and job will hang or fail
directly.*


*Any Suggestions? And are there any guides about resource for job in
view of computing, cache, shuffle, etc.*


*Thank You!*


Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Mohammed Guller
I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 
worker). Our app is fetching data from Cassandra and doing a basic filter, map, 
and countByKey on that data. I have run into a strange problem. Even if the 
number of rows in Cassandra is just 1M, the Spark job goes seems to go into an 
infinite loop and runs for hours. With a small amount of data (less than 100 
rows), the job does finish, but takes almost 30-40 seconds and we frequently 
see the messages shown below. If we run the same application on a single node 
Spark (--master local[4]), then we don't see these warnings and the task 
finishes in less than 6-7 seconds. Any idea what could be the cause for these 
problems when we run our application on a standalone 4-node spark cluster?

14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(0, 192.168.222.142, 39342, 0)
14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(0, 192.168.222.142, 39342, 0)
14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)
14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)
14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)
14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0)
14/06/30 19:30:42 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:43 WARN TaskSetManager: Lost TID 30193 (task 6.18:0)
14/06/30 19:30:43 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:45 WARN TaskSetManager: Lost TID 30559 (task 6.19:98)
14/06/30 19:30:45 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(1, 192.168.222.152, 45896, 0)
14/06/30 19:30:46 WARN TaskSetManager: Lost TID 30826 (task 6.20:0)
14/06/30 19:30:46 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(1, 192.168.222.152, 45896, 0)
14/06/30 19:30:48 WARN TaskSetManager: Lost TID 31098 (task 6.21:0)
14/06/30 19:30:48 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId

Re: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Yana Kadiyska
A lot of things can get funny when you run distributed as opposed to
local -- e.g. some jar not making it over. Do you see anything of
interest in the log on the executor machines -- I'm guessing
192.168.222.152/192.168.222.164. From here
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
seems like the warning message is logged after the task fails -- but I
wonder if you might see something more useful as to why it failed to
begin with. As an example we've had cases in Hdfs where a small
example would work, but on a larger example we'd hit a bad file. But
the executor log is usually pretty explicit as to what happened...

On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller moham...@glassbeam.com wrote:
 I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3
 worker). Our app is fetching data from Cassandra and doing a basic filter,
 map, and countByKey on that data. I have run into a strange problem. Even if
 the number of rows in Cassandra is just 1M, the Spark job goes seems to go
 into an infinite loop and runs for hours. With a small amount of data (less
 than 100 rows), the job does finish, but takes almost 30-40 seconds and we
 frequently see the messages shown below. If we run the same application on a
 single node Spark (--master local[4]), then we don’t see these warnings and
 the task finishes in less than 6-7 seconds. Any idea what could be the cause
 for these problems when we run our application on a standalone 4-node spark
 cluster?



 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)

 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)

 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)

 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)

 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(0, 192.168.222.142, 39342, 0)

 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)

 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(0, 192.168.222.142, 39342, 0)

 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)

 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)

 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)

 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)

 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)

 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)

 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)

 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)

 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)

 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)

 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)

 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)

 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0

Re: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Mayur Rustagi
It could be cause you are out of memory on the worker nodes  blocks are
not getting registered..
A older issue with 0.6.0 was with dead nodes causing loss of task  then
resubmission of data in an infinite loop... It was fixed in 0.7.0 though.
Are you seeing a crash log in this log.. or in the worker log @ 192.168.222.164
or any of the machines where the crash log is displayed.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Wed, Jul 2, 2014 at 7:51 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 A lot of things can get funny when you run distributed as opposed to
 local -- e.g. some jar not making it over. Do you see anything of
 interest in the log on the executor machines -- I'm guessing
 192.168.222.152/192.168.222.164. From here

 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 seems like the warning message is logged after the task fails -- but I
 wonder if you might see something more useful as to why it failed to
 begin with. As an example we've had cases in Hdfs where a small
 example would work, but on a larger example we'd hit a bad file. But
 the executor log is usually pretty explicit as to what happened...

 On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller moham...@glassbeam.com
 wrote:
  I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3
  worker). Our app is fetching data from Cassandra and doing a basic
 filter,
  map, and countByKey on that data. I have run into a strange problem.
 Even if
  the number of rows in Cassandra is just 1M, the Spark job goes seems to
 go
  into an infinite loop and runs for hours. With a small amount of data
 (less
  than 100 rows), the job does finish, but takes almost 30-40 seconds and
 we
  frequently see the messages shown below. If we run the same application
 on a
  single node Spark (--master local[4]), then we don’t see these warnings
 and
  the task finishes in less than 6-7 seconds. Any idea what could be the
 cause
  for these problems when we run our application on a standalone 4-node
 spark
  cluster?
 
 
 
  14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
 
  14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
 
  14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
 
  14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
 
  14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(0, 192.168.222.142, 39342, 0)
 
  14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
 
  14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(0, 192.168.222.142, 39342, 0)
 
  14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
 
  14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
 
  14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
 
  14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
 
  14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
 
  14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
 
  14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
 
  14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
 
  14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
 
  14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from