Re: [gridengine users] GE2011.11p1 on SLES 11.3 - execution host in error state

Reuti Tue, 22 Apr 2014 07:33:42 -0700

Am 22.04.2014 um 11:11 schrieb Sve N:

> thanks for your answer, Reuti:
> The spool directory is a local folder, which exists and can be used (to 
> confirm I just tested this with the "KEEP_ACTIVE" parameter set - 
> interestingly the error occured not before the fourth of the small jobs, 
> which indicates, that time (or load, or similar) are probably involved. When 
> comparing the files, those of the finished jobs look very similar, for the 
> error-job only 'config', 'environment' and the 'pe_hostfile' were created, 
> but they seemed normal).
> 
> The binaries are all the same, they're not on a nfs-mount, but copies of each 
> other. Except the ones on one test-node, of course, which I compiled myself 
> to see, whether this solves my problem.
> 
> And there is only one execd running - process was the wrong word, maybe 
> "threads"? If I 'ps axu | grep execd' I only find one daemon, but if I 
> 'strace -ff -p' it, one gets the output of five processes/threads. I would 
> have liked to start execd in some single-thread-mode, since I think 
> parallelism can be a source of seemingly random segfaults, but I didn't find 
> an option.
> 
> I might be wrong there and I'm willing to test those things, but I do not 
> really think, that my ge-configuration causes the problem, since all I did 
> was an update of suse linux enterprise server 11.1 to suse linux enterprise 
> server 11.3 of only the execution hosts, I didn't touch the sge-folders at 
> all. And secondly the only "error" I can read out of the straces is a 
> segfault, which normally shouldn't be caused by a wrong configuration file, 
> or so, too...


Can you check which libraries the binaries use? Maybe they are different 
compared to the ones on the head node where you still run 11.1, and it could 
give you a hint what is causing it.

-- Reuti


> Sven
> 
> > Date: Thu, 17 Apr 2014 17:11:47 +0200
> > From: Reuti <[email protected]>
> > To: Sve N <[email protected]>
> > Cc: "[email protected]" <[email protected]>
> > Subject: Re: [gridengine users] GE2011.11p1 on SLES 11.3 - execution
> > host in     error state
> > Message-ID:
> > <[email protected]>
> > Content-Type: text/plain; charset=us-ascii
> > 
> > Hi,
> > 
> > Am 17.04.2014 um 16:30 schrieb Sve N:
> > 
> > > we have been using open grid engine for some time now on our linux 
> > > machines, which were running suse linux enterprise server 11.1. I 
> > > recently updated some of them to SLES 11.3 (some were just patched, and 
> > > some had a fresh install) and since then, gridengine has some faulty 
> > > behavior:
> > > 
> > > The first job submitted to an execution host runs and finishes correctly, 
> > > but if one submits a second one, the host switches into an error state 
> > > instantly, leaving the second job as 'qw'. I seems as if there is a very 
> > > small time-window (< ~1 s), where a second job can be submitted after the 
> > > first one but anything later, independent of whether the first one is 
> > > still running or not, results in the error.
> > > To be able to run the next job, one has to stop and start 
> > > /etc/init.d/sgeexecd.
> > > 
> > > The messages-file of the spools-directory says:
> > > ___________________________________________________________________________________________________
> > > 04/17/2014 11:40:15| main|host-4|I|controlled shutdown 2011.11
> > > 04/17/2014 11:40:22| main|host-4|I|starting up OGS/GE 2011.11 (linux-x64)
> > > 04/17/2014 11:41:28| main|host-4|E|shepherd of job 4758.1 died through 
> > > signal = 11
> > > 04/17/2014 11:41:28| main|host-4|E|abnormal termination of shepherd for 
> > > job 4758.1: no "exit_status" file
> > > 04/17/2014 11:41:28| main|host-4|E|can't open file 
> > > active_jobs/4758.1/error: Datei oder Verzeichnis nicht gefunden
> > 
> > What is the location of the spool directory on the exechosts? Does it live 
> > in the NFS location where you are sharing the binaries, or do they go to a 
> > local place like /var/spool/sge? Maybe this needs to be created.
> > 
> > 
> > > 04/17/2014 11:41:28| main|host-4|E|can't open pid file 
> > > "active_jobs/4758.1/pid" for job 4758.1
> > > ___________________________________________________________________________________________________
> > > 
> > > Where 4758 is the second job. The signal mostly is 11, sometimes 6, I 
> > > don't know how to influence this. I used strace on the execd, to maybe 
> > > get a clue. The output for the newly started process, which is invoked 
> > > for the second job contained this:
> > > ___________________________________________________________________________________________________
> > > set_robust_list(0x7f0c9366c9e0, 0x18) = 0
> > > getsockname(3, {sa_family=AF_INET, sin_port=htons(60960), 
> > > sin_addr=inet_addr("1.2.3.4")}, [16]) = 0
> > > getpeername(3, {sa_family=AF_INET, sin_port=htons(389), 
> > > sin_addr=inet_addr("5.6.7.8")}, [16]) = 0
> > > fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
> > > dup(3) = 7
> > > fcntl(7, F_SETFD, FD_CLOEXEC) = 0
> > > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> > > fcntl(8, F_GETFD) = 0
> > > dup2(8, 3) = 3
> > > fcntl(3, F_SETFD, 0) = 0
> > > close(8) = 0
> > > --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> > > ___________________________________________________________________________________________________
> > > 
> > > Since there is a segmentation fault, I thought, that maybe some libraries 
> > > changed on the new suse version, so I compiled gridengine on one of the 
> > > new machines. Since I probably don't need everything, they are only used 
> > > as execution hosts, I used ./aimk -only-core -no-jni -no-java. With some
> > 
> > All machines should use the same binaries. Do you run different version on 
> > different machines on the cluster?
> > 
> > 
> > > tinkering it finally worked until and including the creation of the local 
> > > distribution. But the install_execd-script complained that qmake, qtcsh, 
> > > rlogin, rsh and rshd are missing. So I just copied all the other 
> > > binaries, libraries and files to a host with an old gridengine version 
> > > installed. Unfortunately this didn't solve the problem.
> > > The output of strace of the new process now looks a bit different:
> > > ___________________________________________________________________________________________________
> > > set_robust_list(0x7ffe451639e0, 0x18) = 0
> > > getsockname(3, {sa_family=AF_INET, sin_port=htons(51974), 
> > > sin_addr=inet_addr("1.2.3.5")}, [16]) = 0
> > > getpeername(3, {sa_family=AF_INET, sin_port=htons(389), 
> > > sin_addr=inet_addr("5.6.7.8")}, [16]) = 0
> > > fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
> > > dup(3) = 7
> > > fcntl(7, F_SETFD, FD_CLOEXEC) = 0
> > > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> > > fcntl(8, F_GETFD) = 0
> > > dup2(8, 3) = 3
> > > fcntl(3, F_SETFD, 0) = 0
> > > close(8) = 0
> > > open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK) = -1 ENXIO (No such device 
> > > or address)
> > > writev(2, [{"*** glibc detected *** ", 23}, 
> > > {"/opt/sge/bin/linux-x64/sge_execd", 32}, {": ", 2}, {"free(): invalid 
> > > pointer", 23}, {": 0x", 4}, {"00007ffe44425188", 16}, {" ***\n", 5}], 7) 
> > > = 105
> > > open("/opt/sge/bin/linux-x64/../../lib/linux-x64/libgcc_s.so.1", 
> > > O_RDONLY) = -1 ENOENT (No such file or directory)
> > > open("/opt/sge/lib/linux-x64/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No 
> > > such file or directory)
> > > open("/etc/ld.so.cache", O_RDONLY) = 8
> > > fstat(8, {st_mode=S_IFREG|0644, st_size=50062, ...}) = 0
> > > mmap(NULL, 50062, PROT_READ, MAP_PRIVATE, 8, 0) = 0x7ffe45136000
> > > close(8) = 0
> > > open("/lib64/libgcc_s.so.1", O_RDONLY) = 8
> > > read(8, 
> > > "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200.\0\0\0\0\0\0"..., 
> > > 832) = 832
> > > fstat(8, {st_mode=S_IFREG|0755, st_size=88552, ...}) = 0
> > > mmap(NULL, 2184216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 8, 0) 
> > > = 0x7ffe3fe06000
> > > fadvise64(8, 0, 2184216, POSIX_FADV_WILLNEED) = 0
> > > mprotect(0x7ffe3fe1b000, 2093056, PROT_NONE) = 0
> > > mmap(0x7ffe4001a000, 8192, PROT_READ|PROT_WRITE, 
> > > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 8, 0x14000) = 0x7ffe4001a000
> > > close(8) = 0
> > > mprotect(0x7ffe4001a000, 4096, PROT_READ) = 0
> > > munmap(0x7ffe45136000, 50062) = 0
> > > futex(0x7ffe448ba610, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> > > futex(0x7ffe4001b1a4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> > > write(2, "======= Backtrace: =========\n", 29) = 29
> > > writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"+0x", 3}, {"76618", 5}, 
> > > {")", 1}, {"[0x", 3}, {"7ffe445ba618", 12}, {"]\n", 2}], 8) = 43
> > > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, 
> > > {"ldap_free_urldesc", 17}, {"+0x", 3}, {"19", 2}, {")", 1}, {"[0x", 3}, 
> > > {"7ffe435ae449", 12}, {"]\n", 2}], 9) = 68
> > > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, 
> > > {"ldap_free_urllist", 17}, {"+0x", 3}, {"18", 2}, {")", 1}, {"[0x", 3}, 
> > > {"7ffe435ae4c8", 12}, {"]\n", 2}], 9) = 68
> > > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, 
> > > {"ldap_free_connection", 20}, {"+0x", 3}, {"132", 3}, {")", 1}, {"[0x", 
> > > 3}, {"7ffe435aaed2", 12}, {"]\n", 2}], 9) = 72
> > > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, 
> > > {"ldap_ld_free", 12}, {"+0x", 3}, {"b7", 2}, {")", 1}, {"[0x", 3}, 
> > > {"7ffe435a1d77", 12}, {"]\n", 2}], 9) = 63
> > > writev(2, [{"/lib64/libnss_ldap.so.2", 23}, {"(", 1}, {"+0x", 3}, 
> > > {"4047", 4}, {")", 1}, {"[0x", 3}, {"7ffe437d5047", 12}, {"]\n", 2}], 8) 
> > > = 49
> > > writev(2, [{"/lib64/libnss_ldap.so.2", 23}, {"(", 1}, {"+0x", 3}, 
> > > {"7ad5", 4}, {")", 1}, {"[0x", 3}, {"7ffe437d8ad5", 12}, {"]\n", 2}], 8) 
> > > = 49
> > > writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_fork", 11}, 
> > > {"+0x", 3}, {"1df", 3}, {")", 1}, {"[0x", 3}, {"7ffe445ec87f", 12}, 
> > > {"]\n", 2}], 9) = 52
> > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, 
> > > {"sge_exec_job", 12}, {"+0x", 3}, {"5a05", 4}, {")", 1}, {"[0x", 3}, 
> > > {"4332f5", 6}, {"]\n", 2}], 9) = 64
> > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"[0x", 3}, 
> > > {"435293", 6}, {"]\n", 2}], 4) = 43
> > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, 
> > > {"do_ck_to_do", 11}, {"+0x", 3}, {"286", 3}, {")", 1}, {"[0x", 3}, 
> > > {"435906", 6}, {"]\n", 2}], 9) = 62
> > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, 
> > > {"sge_execd_process_messages", 26}, {"+0x", 3}, {"43c", 3}, {")", 1}, 
> > > {"[0x", 3}, {"42cd7c", 6}, {"]\n", 2}], 9) = 77
> > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, {"main", 
> > > 4}, {"+0x", 3}, {"b14", 3}, {")", 1}, {"[0x", 3}, {"429ed4", 6}, {"]\n", 
> > > 2}], 9) = 55
> > > writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_start_main", 17}, 
> > > {"+0x", 3}, {"e6", 2}, {")", 1}, {"[0x", 3}, {"7ffe44562c36", 12}, 
> > > {"]\n", 2}], 9) = 57
> > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, 
> > > {"setlocale", 9}, {"+0x", 3}, {"1f9", 3}, {")", 1}, {"[0x", 3}, 
> > > {"428cd9", 6}, {"]\n", 2}], 9) = 60
> > > write(2, "======= Memory map: ========\n", 29) = 29
> > > open("/proc/self/maps", O_RDONLY) = 8
> > > read(8, "00400000-005b1000 r-xp 00000000 "..., 1024) = 1024
> > > write(2, "00400000-005b1000 r-xp 00000000 "..., 1024) = 1024
> > > read(8, " /lib64/libz.so.1.2.7\n"..., 1024) = 1024
> > > write(2, " /lib64/libz.so.1.2.7\n"..., 1024) = 1024
> > > read(8, "0.1\n7ffe41e40000-7ffe41e41000 rw"..., 1024) = 1024
> > > write(2, "0.1\n7ffe41e40000-7ffe41e41000 rw"..., 1024) = 1024
> > > read(8, "03:01 2611268 "..., 1024) = 1024
> > > write(2, "03:01 2611268 "..., 1024) = 1024
> > > read(8, "000 r--p 00014000 103:01 1733331"..., 1024) = 1024
> > > write(2, "000 r--p 00014000 103:01 1733331"..., 1024) = 1024
> > > read(8, "m_err.so.2.1\n7ffe42ebc000-7ffe42"..., 1024) = 1024
> > > write(2, "m_err.so.2.1\n7ffe42ebc000-7ffe42"..., 1024) = 1024
> > > read(8, ".2.7.1\n7ffe43586000-7ffe43587000"..., 1024) = 1024
> > > write(2, ".2.7.1\n7ffe43586000-7ffe43587000"..., 1024) = 1024
> > > read(8, "\n7ffe439e6000-7ffe439f2000 rw-p "..., 1024) = 1024
> > > write(2, "\n7ffe439e6000-7ffe439f2000 rw-p "..., 1024) = 1024
> > > read(8, "0000 00:00 0 \n7ffe448bc000-7ffe4"..., 1024) = 1024
> > > write(2, "0000 00:00 0 \n7ffe448bc000-7ffe4"..., 1024) = 1024
> > > read(8, "ibdl-2.11.3.so\n7ffe44f54000-7ffe"..., 1024) = 1024
> > > write(2, "ibdl-2.11.3.so\n7ffe44f54000-7ffe"..., 1024) = 1024
> > > read(8, " 00:00 0 "..., 1024) = 206
> > > write(2, " 00:00 0 "..., 206) = 206
> > > read(8, "", 1024) = 0
> > > close(8) = 0
> > > rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
> > > tgkill(13089, 13089, SIGABRT) = 0
> > > --- SIGABRT (Aborted) @ 0 (0) ---
> > > ___________________________________________________________________________________________________
> > > 
> > > I can't interpret this good enough, to know what went wrong (if it's in 
> > > it in the first place). The strace output of the five execd-processes 
> > > running
> > 
> > There should only be one execd per host. If there is still an old one 
> > running, maybe it's best to reboot the machine.
> > 
> > -- Reuti
> > 
> > 
> > > constantly in the background is too long. The one probably managing the 
> > > job didn't look very different, comparing a working and a not working 
> > > submission (first and second job), the first real difference are the last 
> > > two lines out of this excerpt, the rest is only slightly differing 
> > > numbers etc.:
> > > ___________________________________________________________________________________________________
> > > [...]
> > > stat("/proc/6/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > > open("/proc/6/status", O_RDONLY) = 8
> > > fstat(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
> > > = 0x7ffe45169000
> > > read(8, "Name:\tmigration/0\nState:\tS (slee"..., 1024) = 799
> > > close(8) = 0
> > > munmap(0x7ffe45169000, 4096) = 0
> > > close(8) = -1 EBADF (Bad file descriptor)
> > > stat("/proc/7/stat", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > > open("/proc/7/stat", O_RDONLY) = 8
> > > read(8, "7 (watchdog/0) S 2 0 0 0 -1 2216"..., 1023) = 164
> > > close(8) = 0
> > > stat("/proc/7/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > > open("/proc/7/status", O_RDONLY) = 8
> > > fstat(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
> > > = 0x7ffe45169000
> > > read(8, "Name:\twatchdog/0\nState:\tS (sleep"..., 1024) = 800
> > > close(8) = 0
> > > munmap(0x7ffe45169000, 4096) = 0
> > > close(8) = -1 EBADF (Bad file descriptor)
> > > stat("/proc/8/stat", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> > > open("/proc/8/stat", O_RDONLY) = 8
> > > read(8, "8 (migration/1) S 2 0 0 0 -1 221"..., 1023) = 164
> > > close(8) = 0
> > > --- SIGCHLD (Child exited) @ 0 (0) ---
> > > rt_sigreturn(0x11) = 0
> > > ___________________________________________________________________________________________________
> > > 
> > > I'd be happy about any suggestion on how to solve this, or just where I 
> > > could continue searching for the root of the problem.
> > > Thanks, Sven
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] GE2011.11p1 on SLES 11.3 - execution host in error state

Reply via email to