I see. So poll() should be used instead of select() for srun to scale.
在 2013-03-12二的 06:44 -0600,David Bigagli写道:
> This is the way select() works regardless of the version of redhat or
> any other distribution.
>
> The fd_set is a bit array defined in <sys/select.h> of __FD_SETSIZE
> which is defined as 1024 in <bits/typesizes.h>
>
>
> /David
>
>
>
> On Tue, Mar 12, 2013 at 11:30 AM, Hongjia Cao <[email protected]>
> wrote:
> When launching tasks on about 1000 nodes, I get the following
> error
> message sometimes:
>
> srun: error: io_init_msg_read timed out
> srun: error: failed reading io init message
>
> I find the problem in src/common/fd.c, where "select()" is
> used to check
> whether a file descriptor is readable. Running the attached
> program
> tsel.c shows that in RHEL 6.2 file descriptor passed to
> "select()" can
> not exceed 1023, or "FD_ISSET()" will not function correctly:
>
> [root@ln0 select]# cat /etc/issue
> Red Hat Enterprise Linux Server release 6.2 (Santiago)
> Kernel \r on an \m
>
> [root@ln0 select]# uname -a
> Linux ln0 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13
> EST 2011
> x86_64 x86_64 x86_64 GNU/Linux
>
> [root@ln0 select]# ./tsel 1023
> dup2 returned 1023
> file descriptor 1023 in readable set
> file descriptor 1023 in exception set
>
> select returned 1
> file descriptor 1023 readable
>
> [root@ln0 select]# ./tsel 1024
> dup2 returned 1024
> file descriptor 1024 in readable set
> file descriptor 1024 in exception set
>
> select returned 1
>
> [root@ln0 select]# ./tsel 1027
> dup2 returned 1027
> file descriptor 1027 in readable set
> file descriptor 1027 in exception set
> select returned -1
> failed to select:: Bad file descriptor
>
> I changed "select()" to "poll()" to fix this problem.
>
>
>
>