> > I've recently installed SoGE 8.1.2 on a multihomed qmaster. > > It's a clean install, using the packages from SoGE project > > page. This is meant to replace the existing SGE 6.2u5 > > version we have running today. > > [It shouldn't have needed a new install -- you can do a live > upgrade.]
Noted, we opted for this to have a side-by-side installation for testing purposes. Now I'm kind of glad we did it like that. > > We have an internal bind which xCAT manages. > > Why dnsmasq and bind (not that it should be relevant to the crash)? For its ability to cache results for long periods of time in the event of a failure, mainly. This keeps the cluster in normal operation for the most part, even after there's a problem with the DNS server. > > Once the qmaster starts up, I see this in the messages-file: > > 10/31/2012 09:48:11| main|srvname|W|local configuration srvname > not defined - using global configuration > > I'm confused. I thought it crashed immediately, or is the above > with > the NSS change? Otherwise, what are the last messages before the > crash > (preferably with log_Level info)? My fault for not being more clear. It starts up just fine and everything I'd expect works just as our old qmaster did. After approx 2 minutes it segfaults without anything new or even odd added to the messages file. If I change NSS to files, it stops segfaulting altogether and we can almost make use of it. I've included the log you requested at the end of the mail. > > As soon as I change back from pure files to "files dns" it takes > > 2-3 minutes and the qmaster segfaults again. > > Do you mean qmaster runs for that long, or the init script waits > that > long for it? What do you get with and without dns in NSS and > flushing > the nscd cache from > > utilbin/lx-amd64/gethostbyname -all srvname I tried as many options I could think of to get differing results, but the output never changed. > > It might be worth noting that this host is an SGE 6.2u5 qmaster > usually, with the original configuration of the resolver, it works > without problems. > > Are the execds always the same version as the qmaster (although I'd > expect something different if not)? I tried it with differing versions but the qmaster noted it in the logs and then the segfaults made me believe there was an incompatibility, so I've kept them at the same version as the qmaster from early on in the process. messages-output below (I have stripped our internal domain from this log): 11/01/2012 07:26:47| main|srvname|W|local configuration srvname not defined - using global configuration 11/01/2012 07:26:47| main|srvname|I|using "/var/spool/sge" for execd_spool_dir 11/01/2012 07:26:47| main|srvname|I|using "/bin/mail" for mailer 11/01/2012 07:26:47| main|srvname|I|using "/usr/bin/xterm" for xterm 11/01/2012 07:26:47| main|srvname|I|using "none" for load_sensor 11/01/2012 07:26:47| main|srvname|I|using "none" for prolog 11/01/2012 07:26:47| main|srvname|I|using "none" for epilog 11/01/2012 07:26:47| main|srvname|I|using "unix_behavior" for shell_start_mode 11/01/2012 07:26:47| main|srvname|I|using "sh,bash,ksh,csh,tcsh" for login_shells 11/01/2012 07:26:47| main|srvname|I|using "0" for min_uid 11/01/2012 07:26:47| main|srvname|I|using "0" for min_gid 11/01/2012 07:26:47| main|srvname|I|using "20000-20100" for gid_range 11/01/2012 07:26:47| main|srvname|I|using "00:00:40" for load_report_time 11/01/2012 07:26:47| main|srvname|I|using "false" for enforce_project 11/01/2012 07:26:47| main|srvname|I|using "auto" for enforce_user 11/01/2012 07:26:47| main|srvname|I|using "00:05:00" for max_unheard 11/01/2012 07:26:47| main|srvname|I|using "log_info" for loglevel 11/01/2012 07:26:47| main|srvname|I|using "none" for administrator_mail 11/01/2012 07:26:47| main|srvname|I|using "none" for set_token_cmd 11/01/2012 07:26:47| main|srvname|I|using "none" for pag_cmd 11/01/2012 07:26:47| main|srvname|I|using "none" for token_extend_time 11/01/2012 07:26:47| main|srvname|I|using "none" for shepherd_cmd 11/01/2012 07:26:47| main|srvname|I|using "none" for qmaster_params 11/01/2012 07:26:47| main|srvname|I|using "ENABLE_BINDING=true H_MEMORYLOCKED=1099509530624 S_MEMORYLOCKED=1099509530624" for execd_params 11/01/2012 07:26:47| main|srvname|I|using "accounting=true reporting=true flush_time=00:00:15 joblog=false sharelog=00:00:00" for reporting_params 11/01/2012 07:26:47| main|srvname|I|using "100" for finished_jobs 11/01/2012 07:26:47| main|srvname|I|using "builtin" for qlogin_daemon 11/01/2012 07:26:47| main|srvname|I|using "builtin" for qlogin_command 11/01/2012 07:26:47| main|srvname|I|using "builtin" for rsh_daemon 11/01/2012 07:26:47| main|srvname|I|using "builtin" for rsh_command 11/01/2012 07:26:47| main|srvname|I|using "none" for jsv_url 11/01/2012 07:26:47| main|srvname|I|using "ac,h,i,e,o,j,M,N,p,w" for jsv_allowed_mod 11/01/2012 07:26:47| main|srvname|I|using "builtin" for rlogin_daemon 11/01/2012 07:26:47| main|srvname|I|using "builtin" for rlogin_command 11/01/2012 07:26:47| main|srvname|I|using "00:00:00" for reschedule_unknown 11/01/2012 07:26:47| main|srvname|I|using "2000" for max_aj_instances 11/01/2012 07:26:47| main|srvname|I|using "75000" for max_aj_tasks 11/01/2012 07:26:47| main|srvname|I|using "0" for max_u_jobs 11/01/2012 07:26:47| main|srvname|I|using "0" for max_jobs 11/01/2012 07:26:47| main|srvname|I|using "0" for max_advance_reservations 11/01/2012 07:26:47| main|srvname|I|using "false" for reprioritize 11/01/2012 07:26:47| main|srvname|I|using "0" for auto_user_oticket 11/01/2012 07:26:47| main|srvname|I|using "0" for auto_user_fshare 11/01/2012 07:26:47| main|srvname|I|using "none" for auto_user_default_project 11/01/2012 07:26:47| main|srvname|I|using "86400" for auto_user_delete_time 11/01/2012 07:26:47| main|srvname|I|using "false" for delegated_file_staging 11/01/2012 07:26:47| main|srvname|I|using "" for libjvm_path 11/01/2012 07:26:47| main|srvname|I|using "" for additional_jvm_args 11/01/2012 07:26:47| main|srvname|I|read job database with 0 entries in 0 seconds 11/01/2012 07:26:47| main|srvname|W|nr of dynamic event clients exceeds max file descriptor limit, setting MAX_DYN_EC=979 11/01/2012 07:26:47| main|srvname|I|max dynamic event clients is set to 979 11/01/2012 07:26:47| main|srvname|I|qmaster hard descriptor limit is set to 1024 11/01/2012 07:26:47| main|srvname|I|qmaster soft descriptor limit is set to 1024 11/01/2012 07:26:47| main|srvname|I|qmaster will use max. 1004 file descriptors for communication 11/01/2012 07:26:47| main|srvname|I|qmaster will accept max. 979 dynamic event clients 11/01/2012 07:26:47| main|srvname|I|starting up SGE 8.1.2 (lx-amd64) 11/01/2012 07:26:48| main|srvname|I|2 worker threads are enabled 11/01/2012 07:26:48| main|srvname|I|2 listener threads are enabled 11/01/2012 07:26:48|schedu|srvname|I|"scheduler" registers as event client with id 1 event delivery interval 10 11/01/2012 07:26:48|schedu|srvname|I|sge_clab2dev@srvname added "scheduler" to event client list 11/01/2012 07:26:48|schedu|srvname|I|using "default" as algorithm 11/01/2012 07:26:48|schedu|srvname|I|using "0:0:30" for schedule_interval 11/01/2012 07:26:48|schedu|srvname|I|using "0:0:0" for load_adjustment_decay_time 11/01/2012 07:26:48|schedu|srvname|I|using "mem_total" for load_formula 11/01/2012 07:26:48|schedu|srvname|I|using "true" for schedd_job_info 11/01/2012 07:26:48|schedu|srvname|I|using param: "none" 11/01/2012 07:26:48|schedu|srvname|I|using "0:0:0" for reprioritize_interval 11/01/2012 07:26:48|schedu|srvname|I|using "cpu=0.75,mem=0.25,io=0" for usage_weight_list 11/01/2012 07:26:48|schedu|srvname|I|using "none" for halflife_decay_list 11/01/2012 07:26:48|schedu|srvname|I|using "OFS" for policy_hierarchy 11/01/2012 07:26:48|schedu|srvname|I|using "NONE" for job_load_adjustments 11/01/2012 07:26:48|schedu|srvname|I|using 0 for maxujobs 11/01/2012 07:26:48|schedu|srvname|I|using 0 for queue_sort_method 11/01/2012 07:26:48|schedu|srvname|I|using 1 for flush_submit_sec 11/01/2012 07:26:48|schedu|srvname|I|using 1 for flush_finish_sec 11/01/2012 07:26:48|schedu|srvname|I|using 144 for halftime 11/01/2012 07:26:48|schedu|srvname|I|using 5 for compensation_factor 11/01/2012 07:26:48|schedu|srvname|I|using 0.25 for weight_user 11/01/2012 07:26:48|schedu|srvname|I|using 0.25 for weight_project 11/01/2012 07:26:48|schedu|srvname|I|using 0.25 for weight_department 11/01/2012 07:26:48|schedu|srvname|I|using 0.25 for weight_job 11/01/2012 07:26:48|schedu|srvname|I|using 10000 for weight_tickets_functional 11/01/2012 07:26:48|schedu|srvname|I|using 100000 for weight_tickets_share 11/01/2012 07:26:48|schedu|srvname|I|using 1 for share_override_tickets 11/01/2012 07:26:48|schedu|srvname|I|using 1 for share_functional_shares 11/01/2012 07:26:48|schedu|srvname|I|using 200 for max_functional_jobs_to_schedule 11/01/2012 07:26:48|schedu|srvname|I|using 1 for report_pjob_tickets 11/01/2012 07:26:48|schedu|srvname|I|using 50 for max_pending_tasks_per_job 11/01/2012 07:26:48|schedu|srvname|I|using 0.5 for weight_ticket 11/01/2012 07:26:48|schedu|srvname|I|using 0.075 for weight_waiting_time 11/01/2012 07:26:48|schedu|srvname|I|using 3.6e+06 for weight_deadline 11/01/2012 07:26:48|schedu|srvname|I|using 0.5 for weight_urgency 11/01/2012 07:26:48|schedu|srvname|I|using 1 for weight_priority 11/01/2012 07:26:48|schedu|srvname|I|using 100 for max_reservation 11/01/2012 07:26:48| main|srvname|I|scheduler has been started 11/01/2012 07:26:48| main|srvname|I|start of jvm thread is disabled in bootstrap file 11/01/2012 07:26:48| main|srvname|I|qmaster startup took 1 seconds 11/01/2012 07:27:12|worker|srvname|I|execd on cla-014.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-005.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-001.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-012.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-009.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-002.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-006.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-007.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-010.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-008.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-011.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-013.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-003.cluster registered 11/01/2012 07:27:12|worker|srvname|I|execd on cla-004.cluster registered Then ~ one minute of silence, and this appears in the syslog: Nov 1 07:28:06 srvname kernel: sge_qmaster[26300]: segfault at 0000000000000068 rip 000000000049e49c rsp 0000000046d52f00 error 6 Next, I'll try the GDB approach. Wbr Andreas -------------------------------------------------------------------------- Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
