Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-30 Thread Adrian Bridgett
I've worked around this by setting 
spark.shuffle.io.connectionTimeout=3600s, uploading the spark tarball to 
HDFS again and restarting the shuffle service (not 100% sure that last 
step is needed).


I attempted (with my newbie Scala skills) to allow 
ExternalShuffleClient() to accept an optional closeIdleConnections 
parameter (defaulting to "true") so that the MesosExternalShuffleClient 
can set this to "false".  I then passsed this into the TransportContext 
call.  However this didn't seem to work (maybe it's using the config 
from HDFS not the local spark (which I thought the Driver used).


Anyhow I'll do more testing and then raise a JIRA.

Adrian
--
*Adrian Bridgett* |  Sysadmin Engineer, OpenSignal 


_
Office: First Floor, Scriptor Court, 155-157 Farringdon Road, 
Clerkenwell, London, EC1R 3AD

Phone #: +44 777-377-8251
Skype: abridgett  |@adrianbridgett | 
LinkedIn link 

_


Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-30 Thread Adrian Bridgett

Hi Ted,

sorry I should have been a bit more consistent in my cut and paste 
(there are nine nodes +driver) - I was concentrating on S9/6 (these logs 
are from that box - 10.1.201.165). S1/4 lines are:


15/12/29 18:49:45 INFO CoarseMesosSchedulerBackend: Registered executor 
NettyRpcEndpointRef(null) (ip-10-1-202-114.ec2.internal:19891) with ID 
f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4
15/12/29 18:49:45 INFO ExecutorAllocationManager: New executor 
f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4 has registered (new total is 6)
15/12/29 18:49:45 INFO BlockManagerMasterEndpoint: Registering block 
manager ip-10-1-202-114.ec2.internal:14257 with 13.8 GB RAM, 
BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4, 
ip-10-1-202-114.ec2.internal, 14257)
15/12/29 18:58:07 WARN TaskSetManager: Lost task 21.0 in stage 1.0 (TID 
2149, ip-10-1-200-232.ec2.internal): 
FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4, 
ip-10-1-202-114.ec2.internal, 7337), shuffleId=1, mapId=5, reduceId=21, 
message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.RuntimeException: Executor is not registered 
(appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014, 
execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)


I've tried to see how I can increase the idle timeout of the 
mesosExternalShuffleClient.registerDriverWithShuffleService as thats 
seems to be the core issue.



On 29/12/2015 21:17, Ted Yu wrote:

Have you searched log for 'f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4' ?

In the snippet you posted, I don't see registration of this Executor.

Cheers

On Tue, Dec 29, 2015 at 12:43 PM, Adrian Bridgett 
> wrote:


We're seeing an "Executor is not registered" error on a Spark
(1.6.0rc4, mesos-0.26) cluster.  It seems as if the logic in
MesosExternalShuffleService.scala isn't working for some reason
(new in 1.6 I believe).

spark application sees this:
...
15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully
registered app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with
external shuffle service.
15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully
registered app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with
external shuffle service.
15/12/29 18:49:43 INFO CoarseMesosSchedulerBackend: Registered
executor NettyRpcEndpointRef(null)
(ip-10-1-201-165.ec2.internal:37660) with ID
f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6
15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor
f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1 has registered (new
total is 1)
15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering
block manager ip-10-1-201-165.ec2.internal:53854 with 13.8 GB RAM,
BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6,
ip-10-1-201-165.ec2.internal, 53854)
15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering
block manager ip-10-1-201-132.ec2.internal:12793 with 13.8 GB RAM,
BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1,
ip-10-1-201-132.ec2.internal, 12793)
15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor
f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6 has registered (new
total is 2)
...
15/12/29 18:58:06 INFO BlockManagerInfo: Added broadcast_6_piece0
in memory on ip-10-1-201-165.ec2.internal:53854 (size: 5.2KB,
free: 13.8GB)
15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to
send map output locations for shuffle 1 to
ip-10-1-202-121.ec2.internal:59734
15/12/29 18:58:06 INFO MapOutputTrackerMaster: Size of output
statuses for shuffle 1 is 1671814 bytes
15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to
send map output locations for shuffle 1 to
ip-10-1-201-165.ec2.internal:37660
...
15/12/29 18:58:07 INFO TaskSetManager: Starting task 63.0 in stage
1.0 (TID 2191, ip-10-1-200-232.ec2.internal, partition
63,PROCESS_LOCAL, 2171 bytes)
15/12/29 18:58:07 WARN TaskSetManager: Lost task 21.0 in stage 1.0
(TID 2149, ip-10-1-200-232.ec2.internal):
FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4,
ip-10-1-202-114.ec2.internal, 7337), shuffleId=1, mapId=5,
reduceId=21, message=
org.apache.spark.shuffle.FetchFailedException:
java.lang.RuntimeException: Executor is not registered
(appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4)
at

org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
...
15/12/29 18:58:07 INFO DAGScheduler: Resubmitting ShuffleMapStage
0 (reduceByKey at
/home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:48)
and ShuffleMapStage 1 (reduceByKey at

Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-30 Thread Adrian Bridgett
To wrap this up, it's the shuffle manager sending the FIN so setting 
spark.shuffle.io.connectionTimeout to 3600s is the only workaround right 
now.  SPARK-12583 raised.


Adrian

--
*Adrian Bridgett*


Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-29 Thread Ted Yu
Have you searched log for 'f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4' ?

In the snippet you posted, I don't see registration of this Executor.

Cheers

On Tue, Dec 29, 2015 at 12:43 PM, Adrian Bridgett 
wrote:

> We're seeing an "Executor is not registered" error on a Spark (1.6.0rc4,
> mesos-0.26) cluster.  It seems as if the logic in
> MesosExternalShuffleService.scala isn't working for some reason (new in 1.6
> I believe).
>
> spark application sees this:
> ...
> 15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully registered
> app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with external shuffle service.
> 15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully registered
> app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with external shuffle service.
> 15/12/29 18:49:43 INFO CoarseMesosSchedulerBackend: Registered executor
> NettyRpcEndpointRef(null) (ip-10-1-201-165.ec2.internal:37660) with ID
> f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6
> 15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor
> f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1 has registered (new total is 1)
> 15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering block
> manager ip-10-1-201-165.ec2.internal:53854 with 13.8 GB RAM,
> BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6,
> ip-10-1-201-165.ec2.internal, 53854)
> 15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering block
> manager ip-10-1-201-132.ec2.internal:12793 with 13.8 GB RAM,
> BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1,
> ip-10-1-201-132.ec2.internal, 12793)
> 15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor
> f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6 has registered (new total is 2)
> ...
> 15/12/29 18:58:06 INFO BlockManagerInfo: Added broadcast_6_piece0 in
> memory on ip-10-1-201-165.ec2.internal:53854 (size: 5.2KB, free: 13.8GB)
> 15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> output locations for shuffle 1 to ip-10-1-202-121.ec2.internal:59734
> 15/12/29 18:58:06 INFO MapOutputTrackerMaster: Size of output statuses for
> shuffle 1 is 1671814 bytes
> 15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> output locations for shuffle 1 to ip-10-1-201-165.ec2.internal:37660
> ...
> 15/12/29 18:58:07 INFO TaskSetManager: Starting task 63.0 in stage 1.0
> (TID 2191, ip-10-1-200-232.ec2.internal, partition 63,PROCESS_LOCAL, 2171
> bytes)
> 15/12/29 18:58:07 WARN TaskSetManager: Lost task 21.0 in stage 1.0 (TID
> 2149, ip-10-1-200-232.ec2.internal):
> FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4,
> ip-10-1-202-114.ec2.internal, 7337), shuffleId=1, mapId=5, reduceId=21,
> message=
> org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException:
> Executor is not registered
> (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
> execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4)
> at
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
> ...
> 15/12/29 18:58:07 INFO DAGScheduler: Resubmitting ShuffleMapStage 0
> (reduceByKey at
> /home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:48) and
> ShuffleMapStage 1 (reduceByKey at
> /home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:50) due to
> fetch failure
> 15/12/29 18:58:07 WARN TaskSetManager: Lost task 12.0 in stage 1.0 (TID
> 2140, ip-10-1-200-232.ec2.internal):
> FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6,
> ip-10-1-201-165.ec2.internal, 7337), shuffleId=1, mapId=6, reduceId=12,
> message=
> org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException:
> Executor is not registered
> (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
> execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6)
> at
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
>
>
> shuffle service itself (on executor's IP sees:
> 15/12/29 18:49:41 INFO MesosExternalShuffleBlockHandler: Received
> registration request from app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014
> (remote address /10.1.200.165:37889).  (that's the driver IP)
> 15/12/29 18:49:43 WARN MesosExternalShuffleBlockHandler: Unknown /
> 10.1.201.165:52562 disconnected. (executor IP)
> 15/12/29 18:51:41 INFO MesosExternalShuffleBlockHandler: Application
> a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 disconnected (address was /
> 10.1.200.165:37889). (driver IP again)
> 15/12/29 18:58:07 ERROR TransportRequestHandler: Error while invoking
> RpcHandler#receive() on RPC id 6244044000322436908
> java.lang.RuntimeException: Executor is not registered
> (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
> execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6) (executor IP)
> at
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
>
> At first I wondered if reducing 

Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-29 Thread Adrian Bridgett
We're seeing an "Executor is not registered" error on a Spark (1.6.0rc4, 
mesos-0.26) cluster.  It seems as if the logic in 
MesosExternalShuffleService.scala isn't working for some reason (new in 
1.6 I believe).


spark application sees this:
...
15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully 
registered app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with external 
shuffle service.
15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully 
registered app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with external 
shuffle service.
15/12/29 18:49:43 INFO CoarseMesosSchedulerBackend: Registered executor 
NettyRpcEndpointRef(null) (ip-10-1-201-165.ec2.internal:37660) with ID 
f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6
15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor 
f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1 has registered (new total is 1)
15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering block 
manager ip-10-1-201-165.ec2.internal:53854 with 13.8 GB RAM, 
BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6, 
ip-10-1-201-165.ec2.internal, 53854)
15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering block 
manager ip-10-1-201-132.ec2.internal:12793 with 13.8 GB RAM, 
BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1, 
ip-10-1-201-132.ec2.internal, 12793)
15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor 
f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6 has registered (new total is 2)

...
15/12/29 18:58:06 INFO BlockManagerInfo: Added broadcast_6_piece0 in 
memory on ip-10-1-201-165.ec2.internal:53854 (size: 5.2KB, free: 13.8GB)
15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
output locations for shuffle 1 to ip-10-1-202-121.ec2.internal:59734
15/12/29 18:58:06 INFO MapOutputTrackerMaster: Size of output statuses 
for shuffle 1 is 1671814 bytes
15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
output locations for shuffle 1 to ip-10-1-201-165.ec2.internal:37660

...
15/12/29 18:58:07 INFO TaskSetManager: Starting task 63.0 in stage 1.0 
(TID 2191, ip-10-1-200-232.ec2.internal, partition 63,PROCESS_LOCAL, 
2171 bytes)
15/12/29 18:58:07 WARN TaskSetManager: Lost task 21.0 in stage 1.0 (TID 
2149, ip-10-1-200-232.ec2.internal): 
FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4, 
ip-10-1-202-114.ec2.internal, 7337), shuffleId=1, mapId=5, reduceId=21, 
message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.RuntimeException: Executor is not registered 
(appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014, 
execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)

...
15/12/29 18:58:07 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 
(reduceByKey at 
/home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:48) and 
ShuffleMapStage 1 (reduceByKey at 
/home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:50) due to 
fetch failure
15/12/29 18:58:07 WARN TaskSetManager: Lost task 12.0 in stage 1.0 (TID 
2140, ip-10-1-200-232.ec2.internal): 
FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6, 
ip-10-1-201-165.ec2.internal, 7337), shuffleId=1, mapId=6, reduceId=12, 
message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.RuntimeException: Executor is not registered 
(appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014, 
execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)



shuffle service itself (on executor's IP sees:
15/12/29 18:49:41 INFO MesosExternalShuffleBlockHandler: Received 
registration request from app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 
(remote address /10.1.200.165:37889).  (that's the driver IP)
15/12/29 18:49:43 WARN MesosExternalShuffleBlockHandler: Unknown 
/10.1.201.165:52562 disconnected. (executor IP)
15/12/29 18:51:41 INFO MesosExternalShuffleBlockHandler: Application 
a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 disconnected (address was 
/10.1.200.165:37889). (driver IP again)
15/12/29 18:58:07 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() on RPC id 6244044000322436908
java.lang.RuntimeException: Executor is not registered 
(appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014, 
execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6) (executor IP)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)


At first I wondered if reducing spark.shuffle.io.numConnectionsPerPeer 
back down to 1 (from 4) would help - maybe it wasn't keeping track of 
the number of connections.However now with the extra debug I wonder 
if the driver is disconnecting after 2mins and the mesos shuffle service 
takes this as a sign that the whole app has finished and tidies up?


I couldn't increase the time