Am 22.04.2014 um 11:11 schrieb Sve N: > thanks for your answer, Reuti: > The spool directory is a local folder, which exists and can be used (to > confirm I just tested this with the "KEEP_ACTIVE" parameter set - > interestingly the error occured not before the fourth of the small jobs, > which indicates, that time (or load, or similar) are probably involved. When > comparing the files, those of the finished jobs look very similar, for the > error-job only 'config', 'environment' and the 'pe_hostfile' were created, > but they seemed normal). > > The binaries are all the same, they're not on a nfs-mount, but copies of each > other. Except the ones on one test-node, of course, which I compiled myself > to see, whether this solves my problem. > > And there is only one execd running - process was the wrong word, maybe > "threads"? If I 'ps axu | grep execd' I only find one daemon, but if I > 'strace -ff -p' it, one gets the output of five processes/threads. I would > have liked to start execd in some single-thread-mode, since I think > parallelism can be a source of seemingly random segfaults, but I didn't find > an option. > > I might be wrong there and I'm willing to test those things, but I do not > really think, that my ge-configuration causes the problem, since all I did > was an update of suse linux enterprise server 11.1 to suse linux enterprise > server 11.3 of only the execution hosts, I didn't touch the sge-folders at > all. And secondly the only "error" I can read out of the straces is a > segfault, which normally shouldn't be caused by a wrong configuration file, > or so, too...
Can you check which libraries the binaries use? Maybe they are different compared to the ones on the head node where you still run 11.1, and it could give you a hint what is causing it. -- Reuti > Sven > > > Date: Thu, 17 Apr 2014 17:11:47 +0200 > > From: Reuti <[email protected]> > > To: Sve N <[email protected]> > > Cc: "[email protected]" <[email protected]> > > Subject: Re: [gridengine users] GE2011.11p1 on SLES 11.3 - execution > > host in error state > > Message-ID: > > <[email protected]> > > Content-Type: text/plain; charset=us-ascii > > > > Hi, > > > > Am 17.04.2014 um 16:30 schrieb Sve N: > > > > > we have been using open grid engine for some time now on our linux > > > machines, which were running suse linux enterprise server 11.1. I > > > recently updated some of them to SLES 11.3 (some were just patched, and > > > some had a fresh install) and since then, gridengine has some faulty > > > behavior: > > > > > > The first job submitted to an execution host runs and finishes correctly, > > > but if one submits a second one, the host switches into an error state > > > instantly, leaving the second job as 'qw'. I seems as if there is a very > > > small time-window (< ~1 s), where a second job can be submitted after the > > > first one but anything later, independent of whether the first one is > > > still running or not, results in the error. > > > To be able to run the next job, one has to stop and start > > > /etc/init.d/sgeexecd. > > > > > > The messages-file of the spools-directory says: > > > ___________________________________________________________________________________________________ > > > 04/17/2014 11:40:15| main|host-4|I|controlled shutdown 2011.11 > > > 04/17/2014 11:40:22| main|host-4|I|starting up OGS/GE 2011.11 (linux-x64) > > > 04/17/2014 11:41:28| main|host-4|E|shepherd of job 4758.1 died through > > > signal = 11 > > > 04/17/2014 11:41:28| main|host-4|E|abnormal termination of shepherd for > > > job 4758.1: no "exit_status" file > > > 04/17/2014 11:41:28| main|host-4|E|can't open file > > > active_jobs/4758.1/error: Datei oder Verzeichnis nicht gefunden > > > > What is the location of the spool directory on the exechosts? Does it live > > in the NFS location where you are sharing the binaries, or do they go to a > > local place like /var/spool/sge? Maybe this needs to be created. > > > > > > > 04/17/2014 11:41:28| main|host-4|E|can't open pid file > > > "active_jobs/4758.1/pid" for job 4758.1 > > > ___________________________________________________________________________________________________ > > > > > > Where 4758 is the second job. The signal mostly is 11, sometimes 6, I > > > don't know how to influence this. I used strace on the execd, to maybe > > > get a clue. The output for the newly started process, which is invoked > > > for the second job contained this: > > > ___________________________________________________________________________________________________ > > > set_robust_list(0x7f0c9366c9e0, 0x18) = 0 > > > getsockname(3, {sa_family=AF_INET, sin_port=htons(60960), > > > sin_addr=inet_addr("1.2.3.4")}, [16]) = 0 > > > getpeername(3, {sa_family=AF_INET, sin_port=htons(389), > > > sin_addr=inet_addr("5.6.7.8")}, [16]) = 0 > > > fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC) > > > dup(3) = 7 > > > fcntl(7, F_SETFD, FD_CLOEXEC) = 0 > > > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8 > > > fcntl(8, F_GETFD) = 0 > > > dup2(8, 3) = 3 > > > fcntl(3, F_SETFD, 0) = 0 > > > close(8) = 0 > > > --- SIGSEGV (Segmentation fault) @ 0 (0) --- > > > ___________________________________________________________________________________________________ > > > > > > Since there is a segmentation fault, I thought, that maybe some libraries > > > changed on the new suse version, so I compiled gridengine on one of the > > > new machines. Since I probably don't need everything, they are only used > > > as execution hosts, I used ./aimk -only-core -no-jni -no-java. With some > > > > All machines should use the same binaries. Do you run different version on > > different machines on the cluster? > > > > > > > tinkering it finally worked until and including the creation of the local > > > distribution. But the install_execd-script complained that qmake, qtcsh, > > > rlogin, rsh and rshd are missing. So I just copied all the other > > > binaries, libraries and files to a host with an old gridengine version > > > installed. Unfortunately this didn't solve the problem. > > > The output of strace of the new process now looks a bit different: > > > ___________________________________________________________________________________________________ > > > set_robust_list(0x7ffe451639e0, 0x18) = 0 > > > getsockname(3, {sa_family=AF_INET, sin_port=htons(51974), > > > sin_addr=inet_addr("1.2.3.5")}, [16]) = 0 > > > getpeername(3, {sa_family=AF_INET, sin_port=htons(389), > > > sin_addr=inet_addr("5.6.7.8")}, [16]) = 0 > > > fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC) > > > dup(3) = 7 > > > fcntl(7, F_SETFD, FD_CLOEXEC) = 0 > > > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8 > > > fcntl(8, F_GETFD) = 0 > > > dup2(8, 3) = 3 > > > fcntl(3, F_SETFD, 0) = 0 > > > close(8) = 0 > > > open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK) = -1 ENXIO (No such device > > > or address) > > > writev(2, [{"*** glibc detected *** ", 23}, > > > {"/opt/sge/bin/linux-x64/sge_execd", 32}, {": ", 2}, {"free(): invalid > > > pointer", 23}, {": 0x", 4}, {"00007ffe44425188", 16}, {" ***\n", 5}], 7) > > > = 105 > > > open("/opt/sge/bin/linux-x64/../../lib/linux-x64/libgcc_s.so.1", > > > O_RDONLY) = -1 ENOENT (No such file or directory) > > > open("/opt/sge/lib/linux-x64/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No > > > such file or directory) > > > open("/etc/ld.so.cache", O_RDONLY) = 8 > > > fstat(8, {st_mode=S_IFREG|0644, st_size=50062, ...}) = 0 > > > mmap(NULL, 50062, PROT_READ, MAP_PRIVATE, 8, 0) = 0x7ffe45136000 > > > close(8) = 0 > > > open("/lib64/libgcc_s.so.1", O_RDONLY) = 8 > > > read(8, > > > "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200.\0\0\0\0\0\0"..., > > > 832) = 832 > > > fstat(8, {st_mode=S_IFREG|0755, st_size=88552, ...}) = 0 > > > mmap(NULL, 2184216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 8, 0) > > > = 0x7ffe3fe06000 > > > fadvise64(8, 0, 2184216, POSIX_FADV_WILLNEED) = 0 > > > mprotect(0x7ffe3fe1b000, 2093056, PROT_NONE) = 0 > > > mmap(0x7ffe4001a000, 8192, PROT_READ|PROT_WRITE, > > > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 8, 0x14000) = 0x7ffe4001a000 > > > close(8) = 0 > > > mprotect(0x7ffe4001a000, 4096, PROT_READ) = 0 > > > munmap(0x7ffe45136000, 50062) = 0 > > > futex(0x7ffe448ba610, FUTEX_WAKE_PRIVATE, 2147483647) = 0 > > > futex(0x7ffe4001b1a4, FUTEX_WAKE_PRIVATE, 2147483647) = 0 > > > write(2, "======= Backtrace: =========\n", 29) = 29 > > > writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"+0x", 3}, {"76618", 5}, > > > {")", 1}, {"[0x", 3}, {"7ffe445ba618", 12}, {"]\n", 2}], 8) = 43 > > > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, > > > {"ldap_free_urldesc", 17}, {"+0x", 3}, {"19", 2}, {")", 1}, {"[0x", 3}, > > > {"7ffe435ae449", 12}, {"]\n", 2}], 9) = 68 > > > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, > > > {"ldap_free_urllist", 17}, {"+0x", 3}, {"18", 2}, {")", 1}, {"[0x", 3}, > > > {"7ffe435ae4c8", 12}, {"]\n", 2}], 9) = 68 > > > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, > > > {"ldap_free_connection", 20}, {"+0x", 3}, {"132", 3}, {")", 1}, {"[0x", > > > 3}, {"7ffe435aaed2", 12}, {"]\n", 2}], 9) = 72 > > > writev(2, [{"/usr/lib64/libldap-2.4.so.2", 27}, {"(", 1}, > > > {"ldap_ld_free", 12}, {"+0x", 3}, {"b7", 2}, {")", 1}, {"[0x", 3}, > > > {"7ffe435a1d77", 12}, {"]\n", 2}], 9) = 63 > > > writev(2, [{"/lib64/libnss_ldap.so.2", 23}, {"(", 1}, {"+0x", 3}, > > > {"4047", 4}, {")", 1}, {"[0x", 3}, {"7ffe437d5047", 12}, {"]\n", 2}], 8) > > > = 49 > > > writev(2, [{"/lib64/libnss_ldap.so.2", 23}, {"(", 1}, {"+0x", 3}, > > > {"7ad5", 4}, {")", 1}, {"[0x", 3}, {"7ffe437d8ad5", 12}, {"]\n", 2}], 8) > > > = 49 > > > writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_fork", 11}, > > > {"+0x", 3}, {"1df", 3}, {")", 1}, {"[0x", 3}, {"7ffe445ec87f", 12}, > > > {"]\n", 2}], 9) = 52 > > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, > > > {"sge_exec_job", 12}, {"+0x", 3}, {"5a05", 4}, {")", 1}, {"[0x", 3}, > > > {"4332f5", 6}, {"]\n", 2}], 9) = 64 > > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"[0x", 3}, > > > {"435293", 6}, {"]\n", 2}], 4) = 43 > > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, > > > {"do_ck_to_do", 11}, {"+0x", 3}, {"286", 3}, {")", 1}, {"[0x", 3}, > > > {"435906", 6}, {"]\n", 2}], 9) = 62 > > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, > > > {"sge_execd_process_messages", 26}, {"+0x", 3}, {"43c", 3}, {")", 1}, > > > {"[0x", 3}, {"42cd7c", 6}, {"]\n", 2}], 9) = 77 > > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, {"main", > > > 4}, {"+0x", 3}, {"b14", 3}, {")", 1}, {"[0x", 3}, {"429ed4", 6}, {"]\n", > > > 2}], 9) = 55 > > > writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"__libc_start_main", 17}, > > > {"+0x", 3}, {"e6", 2}, {")", 1}, {"[0x", 3}, {"7ffe44562c36", 12}, > > > {"]\n", 2}], 9) = 57 > > > writev(2, [{"/opt/sge/bin/linux-x64/sge_execd", 32}, {"(", 1}, > > > {"setlocale", 9}, {"+0x", 3}, {"1f9", 3}, {")", 1}, {"[0x", 3}, > > > {"428cd9", 6}, {"]\n", 2}], 9) = 60 > > > write(2, "======= Memory map: ========\n", 29) = 29 > > > open("/proc/self/maps", O_RDONLY) = 8 > > > read(8, "00400000-005b1000 r-xp 00000000 "..., 1024) = 1024 > > > write(2, "00400000-005b1000 r-xp 00000000 "..., 1024) = 1024 > > > read(8, " /lib64/libz.so.1.2.7\n"..., 1024) = 1024 > > > write(2, " /lib64/libz.so.1.2.7\n"..., 1024) = 1024 > > > read(8, "0.1\n7ffe41e40000-7ffe41e41000 rw"..., 1024) = 1024 > > > write(2, "0.1\n7ffe41e40000-7ffe41e41000 rw"..., 1024) = 1024 > > > read(8, "03:01 2611268 "..., 1024) = 1024 > > > write(2, "03:01 2611268 "..., 1024) = 1024 > > > read(8, "000 r--p 00014000 103:01 1733331"..., 1024) = 1024 > > > write(2, "000 r--p 00014000 103:01 1733331"..., 1024) = 1024 > > > read(8, "m_err.so.2.1\n7ffe42ebc000-7ffe42"..., 1024) = 1024 > > > write(2, "m_err.so.2.1\n7ffe42ebc000-7ffe42"..., 1024) = 1024 > > > read(8, ".2.7.1\n7ffe43586000-7ffe43587000"..., 1024) = 1024 > > > write(2, ".2.7.1\n7ffe43586000-7ffe43587000"..., 1024) = 1024 > > > read(8, "\n7ffe439e6000-7ffe439f2000 rw-p "..., 1024) = 1024 > > > write(2, "\n7ffe439e6000-7ffe439f2000 rw-p "..., 1024) = 1024 > > > read(8, "0000 00:00 0 \n7ffe448bc000-7ffe4"..., 1024) = 1024 > > > write(2, "0000 00:00 0 \n7ffe448bc000-7ffe4"..., 1024) = 1024 > > > read(8, "ibdl-2.11.3.so\n7ffe44f54000-7ffe"..., 1024) = 1024 > > > write(2, "ibdl-2.11.3.so\n7ffe44f54000-7ffe"..., 1024) = 1024 > > > read(8, " 00:00 0 "..., 1024) = 206 > > > write(2, " 00:00 0 "..., 206) = 206 > > > read(8, "", 1024) = 0 > > > close(8) = 0 > > > rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 > > > tgkill(13089, 13089, SIGABRT) = 0 > > > --- SIGABRT (Aborted) @ 0 (0) --- > > > ___________________________________________________________________________________________________ > > > > > > I can't interpret this good enough, to know what went wrong (if it's in > > > it in the first place). The strace output of the five execd-processes > > > running > > > > There should only be one execd per host. If there is still an old one > > running, maybe it's best to reboot the machine. > > > > -- Reuti > > > > > > > constantly in the background is too long. The one probably managing the > > > job didn't look very different, comparing a working and a not working > > > submission (first and second job), the first real difference are the last > > > two lines out of this excerpt, the rest is only slightly differing > > > numbers etc.: > > > ___________________________________________________________________________________________________ > > > [...] > > > stat("/proc/6/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > > > open("/proc/6/status", O_RDONLY) = 8 > > > fstat(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > > > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > > > = 0x7ffe45169000 > > > read(8, "Name:\tmigration/0\nState:\tS (slee"..., 1024) = 799 > > > close(8) = 0 > > > munmap(0x7ffe45169000, 4096) = 0 > > > close(8) = -1 EBADF (Bad file descriptor) > > > stat("/proc/7/stat", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > > > open("/proc/7/stat", O_RDONLY) = 8 > > > read(8, "7 (watchdog/0) S 2 0 0 0 -1 2216"..., 1023) = 164 > > > close(8) = 0 > > > stat("/proc/7/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > > > open("/proc/7/status", O_RDONLY) = 8 > > > fstat(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > > > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > > > = 0x7ffe45169000 > > > read(8, "Name:\twatchdog/0\nState:\tS (sleep"..., 1024) = 800 > > > close(8) = 0 > > > munmap(0x7ffe45169000, 4096) = 0 > > > close(8) = -1 EBADF (Bad file descriptor) > > > stat("/proc/8/stat", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > > > open("/proc/8/stat", O_RDONLY) = 8 > > > read(8, "8 (migration/1) S 2 0 0 0 -1 221"..., 1023) = 164 > > > close(8) = 0 > > > --- SIGCHLD (Child exited) @ 0 (0) --- > > > rt_sigreturn(0x11) = 0 > > > ___________________________________________________________________________________________________ > > > > > > I'd be happy about any suggestion on how to solve this, or just where I > > > could continue searching for the root of the problem. > > > Thanks, Sven > > > _______________________________________________ > > > users mailing list > > > [email protected] > > > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
