Running multiple Giraph jobs on the same cluster can lead to port collisions
----------------------------------------------------------------------------

                 Key: GIRAPH-72
                 URL: https://issues.apache.org/jira/browse/GIRAPH-72
             Project: Giraph
          Issue Type: Bug
          Components: lib, zookeeper
    Affects Versions: 0.70.0
         Environment: production hadoop cluster, in-process ZK.
            Reporter: Jake Mannix


Had a Giraph mini-hackathon at work today, and lots of us launched simultaneous 
test jobs at the same time, and often ran into the following collision:

------
startSuperstep: WORKER_ONLY - Attempt=0, Superstep=-1
2-Nov-2011 23:40:08

java.net.BindException: Problem binding to <hostname>/<hostIP>:30000 : Address 
already in use
        at org.apache.hadoop.ipc.Server.bind(Server.java:196)
        at org.apache.hadoop.ipc.Server$Listener.(Server.java:259)
        at org.apache.hadoop.ipc.Server.(Server.java:1039)
        at org.apache.hadoop.ipc.RPC$Server.(RPC.java:492)
        at org.apache.hadoop.ipc.RPC.getServer(RPC.java:454)
        at 
org.apache.giraph.comm.RPCCommunications.getRPCServer(RPCCommunications.java:99)
        at 
org.apache.giraph.comm.BasicRPCCommunications.(BasicRPCCommunications.java:362)
        at org.apache.giraph.comm.RPCCommunications.(RPCCommunications.java:71)
        at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:570)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.net.BindException: Address already in use
        at sun.nio.ch.Net.bind(Native Method)
        at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
        at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
        at org.apache.hadoop.ipc.Server.bind(Server.java:194)
        ... 12 more
----

The job then simply hung.  What it should do, I'd imagine, is at a bare 
minimum, catch this exception and allow the task to die quickly so it can get 
retried on another machine, or better yet, allow for a command-line arg at 
startup (and then passed into the Configuration) decide what ports to use.  
Best yet, something automagic which allows multiple GraphMappers on the same 
machine without manually picking ports (pick one at random and store it in 
zookeeper?  but then what about the in-process zookeeper...) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to