The reason is actually simple. If you run more than one Giraph worker per machine, there will be a port conflict. Worse yet, imagine multiple Giraph jobs running simultaneously running on a cluster, hence we have the increase port strategy. It would be straightforward to add a configurable option to use a single port though for situations such as yours though (especially since you know where the code is now).

Avery

On 11/22/13 11:19 AM, Larry Compton wrote:
Avery,

It looks like the ports are being allocated the way we suspected (30000 + task ID). That's a problem for us because we'll have to open a wide bank of ports (the SAs want to minimize open ports) and also keep them available for use by Giraph. Ideally, the port allocation would take the host into consideration. If you ask for 200 workers and they're each running on a different host, port 30000 could be used by every Netty server. The way it's working now, a different port is being allocated per worker, which appears unnecessary. Is there a reason a different port is used per worker/task?

Is this still the way ports are allocated in Giraph 1.1.0?

Larry


On Fri, Nov 22, 2013 at 1:18 PM, Avery Ching <[email protected] <mailto:[email protected]>> wrote:

    The port logic is a bit complex, but all encapsulated in
    NettyServer.java (see below).

    If nothing else is running on those ports and you really only have
    one giraph worker per port you should be good to go.  Can you look
    at the logs for the worker that is trying to start a port other
    than base port + taskId?


        int taskId = conf.getTaskPartition();
        int numTasks = conf.getInt("mapred.map.tasks", 1);
        // Number of workers + 1 for master
        int numServers = conf.getInt(GiraphConstants.MAX_WORKERS,
    numTasks) + 1;
        int portIncrementConstant =
            (int) Math.pow(10, Math.ceil(Math.log10(numServers)));
        int bindPort = GiraphConstants.IPC_INITIAL_PORT.get(conf) +
    taskId;
        int bindAttempts = 0;
        final int maxIpcPortBindAttempts =
    MAX_IPC_PORT_BIND_ATTEMPTS.get(conf);
        final boolean failFirstPortBindingAttempt =
    GiraphConstants.FAIL_FIRST_IPC_PORT_BIND_ATTEMPT.get(conf);

        // Simple handling of port collisions on the same machine while
        // preserving debugability from the port number alone.
        // Round up the max number of workers to the next power of 10
    and use
        // it as a constant to increase the port number with.
        while (bindAttempts < maxIpcPortBindAttempts) {
          this.myAddress = new InetSocketAddress(localHostname, bindPort);
          if (failFirstPortBindingAttempt && bindAttempts == 0) {
            if (LOG.isInfoEnabled()) {
              LOG.info("start: Intentionally fail first " +
                  "binding attempt as
    giraph.failFirstIpcPortBindAttempt " +
                  "is true, port " + bindPort);
            }
            ++bindAttempts;
            bindPort += portIncrementConstant;
            continue;
          }

          try {
            Channel ch = bootstrap.bind(myAddress);
            accepted.add(ch);

            break;
          } catch (ChannelException e) {
            LOG.warn("start: Likely failed to bind on attempt " +
                bindAttempts + " to port " + bindPort, e);
            ++bindAttempts;
            bindPort += portIncrementConstant;
          }
        }
        if (bindAttempts == maxIpcPortBindAttempts || myAddress == null) {
          throw new IllegalStateException(
              "start: Failed to start NettyServer with " +
                  bindAttempts + " attempts");

        }



    On 11/22/13 9:15 AM, Larry Compton wrote:

        My teammates and I are running Giraph on a cluster where a
        firewall is configured on each compute node. We had 100 ports
        opened on the compute nodes, which we thought would be more
        than enough to accommodate a large number of workers. However,
        we're unable to go beyond about 90 workers with our Giraph
        jobs, due to Netty ports being allocated outside of the range
        (30000-30100). We're not sure why this is happening. We
        shouldn't be running more than one worker per compute node, so
        we were assuming that only port 30000 would be used, but we're
        routinely seeing Giraph try to use ports greater than 30100
        when we request close to 100 workers. This leads us to believe
        that a simple one up numbering scheme is being used that
        doesn't take the host into consideration, although this is
        only speculation.

        Is there a way around this problem? Our system admins
        understandably balked at opening 1000 ports.

        Larry





Reply via email to