Recovery? recent make world rendered system unusable (64 bit change)
I've been running 5.1-CURRENT for a while and a couple nights ago did a make world. After a couple hours building, my system was unusable. Critical binaries like rm, ls, mtree, sh failed, reporting Exec format error. I can't login, not even single user. I can no longer even boot single user. I've hosed my system and am looking for a way to recover without having to reinstall everything and overwrite critical data and system config files. Naturally, I only discovered the note in UPDATING after I trashed my system -- in fact, I read it from the OK boot prompt with its more. Doh! 20031112: The statfs structure has been updated with 64-bit fields to allow accurate reporting of multi-terabyte filesystem sizes. You should build world, then build and boot the new kernel BEFORE doing a `installworld' as the new kernel will know about binaries using the old statfs structure, but an old kernel will not know about the new system calls that support the new statfs structure. [...] Running an old kernel after a `make world' will cause programs such as `df' that do a statfs system call to fail with a bad system call. [...] DO NOT make installworld after the buildworld w/o building and installing a new kernel FIRST. You will be unable to build a new kernel otherwise on a system with new binaries and an old kernel. I'm looking for recommendations on how to recover, hopefully without trashing my critical system files like /etc/passwd. Ideally, I guess I'd like a way to replace all the broken binaries and any related libraries without overwriting other files. If I do a floppy-based install and then select Custom/Expert than request a minimal install, I presume it will install a small set of binaries but also overwrite /etc/passwd, /etc/ssh/* and so on. Is there a way to have it just update binaries and libraries? If I have to, I could add another disk to this box. Then I could do a floppy install of 5.x on to that new disk. Then I could boot it, and mount the old disk's partitions. Then install the new install's binaries on the old partitions. Or perhaps I could do a make buildworld, kernel, installworld the proper way, using the old disk's partitions as the target. Or could I -- somehow -- push a 64-bit-aware kernel onto this box so that the newly broken binaries will work again? How? Again, I've got no shell access any more so everything's gonna have to be done from floppy or maybe CD if I can borrow a burner. Naturally, this is my net boot server for my diskless clients so I can't go that route either. :-( Any other suggestions? Thanks. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recovery? recent make world rendered system unusable (64 bit change)
masta [EMAIL PROTECTED] writes: The easy way is to grab a recent livecd from the jp snapshot service. [ http://livecd.sourceforge.net/ ] With the jpsnap livecd I was able to boot, copy all the working binaries from the cdrom over the corrupt binaries on the local HDD. I suggest you try the same idea. That seems a like a nice suite, but the site says it's acts like a 4.6 repair, so I don't think the binaries would be suitable for replacing my damaged 5.1 commands. :-( ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recovery? recent make world rendered system unusable (64 bit change)
Barney Wolff [EMAIL PROTECTED] writes: Re-install/upgrade from a cd. Upgrade should leave your files alone. Thanks, Barney -- that's what I did and it saved my butt. A few folks suggested either LiveCD images or fixit functionality. I was kinda dead in the water and didn't think I could download a LiveCD and burn it from another system. I played with the floppy fixit functionality a bit but didn't see a way to preserve /etc and such. So I used a 5.1-RELEASE CD I had and used the UPGRADE option which promised to save my /etc stuff. I specified my old mount points (fortunately, I was able to read /etc/fstab from the boot OK prompt and make paper notes!). I then tried -- twice -- to install the minimal system from the CD and both times it kernel panic'd with a page fault (in process bufdaemon, last time). For grins, I again specified my mounts (only /, /var, /tmp, /usr; I didn't bother with /home and /usr/local), and told it to install via FTP. Surprisingly, this worked -- no panic. It appears to have installed a working kernel, /bin, /usr/bin, and friends and now I'm running again. I'm now doing a make build world and then will do a make kernel KERNCONF=MyKernelDefinitionFileName, then finally a make installworld per the UPGRADING guide. I've never used the Upgrade option to FreeBSD and I've been using it heavily since 2.2.x. It's a good thing. Many thanks to everyone who replied. I promise I'll scan UPGRADING before doing a make *world next time! ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
How to create distribution for later NFS sysinstall on other box?
Some of my systems are 5.1-CURRENT but I still have some older 4.x boxes. I'd like to upgrade them to the same OS as my 5.1 boxes. It seems stupid to feed them boot floppies then FTP the OS across the WAN from freebsd.org or mirrors. I expect there's a way to build a distribution on my main 5.1 system then use sysinstall on the target 4.x to install via NFS (or FTP or...) over the LAN. I have not found any pointers on doing this in the Handbook or a couple quick Googles (perhaps I'm searching on the wrong terms). Seems it should be something like this on the server: cd /usr/src make distribution I'd like to make the distribution based on my 5.1-CURRENT, rather than copying/creating a 5.1-RELEASE image so I won't have to do a subsequent update to get it CURRENT. Any pointers? If I'm missing obvious docs, just tell me where to RTFM. :-) Thanks. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
make buildworld: Signal 11; Illegal instruction
I'm trying to do a make buildworld on my system: FreeBSD PECTOPAH.shenton.org 5.1-CURRENT FreeBSD 5.1-CURRENT #2: Tue Jul 1 19:48:37 EDT 2003 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/PECTOPAH i386 And it keeps dying at various points early in the build. It's a different location each time, some times as soon as 12 seconds, some as long as 100 seconds. Most of the time it's a Signal 11, e.g.: rm -f .depend GPATH GRTAGS GSYMS GTAGS === games/pom *** Signal 11 But sometimes it complains about Illegal instruction: === rescue/rescue/client rm -f dhclient clparse.o dhclient.o dhclient.conf.5.gz dhclient.leases.5.gz dhclient.8.gz dhclient-sc\ ript.8.gz dhclient.conf.5.cat.gz dhclient.leases.5.cat.gz dhclient.8.cat.gz dhclient-script.8.cat.gz Illegal instruction (core dumped) *** Error code 132 This smells like a hardware problem to me. Oddly, this is the first off-the-shelf box I've bought in years. A Dell 600sc with CERC RAID controller, 256MB DELL RAM. To this, I added 512MB Crucial RAM. I've seen this before in heavy builds (mozilla, openoffice, x11) but now it's really buggin' me. I'm kinda stuck if I can't make world. Suggestions? If you think it's marginal HW, do you have any suggestions on how to test and determine the culprit? Thanks. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: make buildworld: Signal 11; Illegal instruction
Chris Shenton [EMAIL PROTECTED] writes: *** Signal 11 ... Illegal instruction (core dumped) *** Error code 132 Also seeing *** Signal 4 if it matters. This sounds way too flakey to be SW. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 5.1-CURRENT hangs on disk i/o? sysctl_old_user() non-sleepablelocks
Don Lewis [EMAIL PROTECTED] writes: Try the very untested patch below ... Well, it seems to be working now, but not necessarily due to this patch. I lost two of the four drives on my ATA RAID card (RAID-5) so lost my entire system :-(. Rebuilt the box from the 5.0-RELEASE floppies/net then cvsupped to 5.1-CURRENT. Reinstalled all the stuff like qmail and apache. I'm no longer seeing the unlocked messages in the logs any longer. Thanks for all your help! ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 5.1-CURRENT hangs on disk i/o? sysctl_old_user() non-sleepablelocks
Don Lewis [EMAIL PROTECTED] writes: Try the very untested patch below ... RCS file: /home/ncvs/src/sys/kern/uipc_syscalls.c,v When I do the patch, how much of the OS do I need to rebuild, just do a make install in the .../src/sys/kern dir? Rebuild the OS from the top dir? Rebuild the kernel? I want to make sure I'm giving this a proper test. Thanks. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 5.1-CURRENT hangs on disk i/o? sysctl_old_user() non-sleepablelocks
Don Lewis [EMAIL PROTECTED] writes: Try the very untested patch below ... RCS file: /home/ncvs/src/sys/kern/uipc_syscalls.c,v retrieving revision 1.150 Try the very untested patch below ... diff -u -r1.150 uipc_syscalls.c --- uipc_syscalls.c 12 Jun 2003 05:52:09 - 1.150 +++ uipc_syscalls.c 18 Jun 2003 03:14:42 - @@ -1775,10 +1775,13 @@ */ if ((error = fgetvp_read(td, uap-fd, vp)) != 0) goto done; + vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td); if (vp-v_type != VREG || VOP_GETVOBJECT(vp, obj) != 0) { error = EINVAL; + VOP_UNLOCK(vp, 0, td); goto done; } + VOP_UNLOCK(vp, 0, td); Tried it, rebuilt kernel, rebooted, no affect :-( You were correct about apache using it. Doing a simple fetch http://pectopah/ causes the error, dropping me into ddb if panic enabled. A tr shows the same trace as I submitted yesterday :-( Time to find that null modem cable. Thanks. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: qmail uses 100% cpu after FreeBSD-5.0 to 5.1 upgrade
Don Lewis [EMAIL PROTECTED] writes: Thanks for doing the testing. I just committed this patch. Seems fine here too -- many thanks. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 5.1-CURRENT hangs on disk i/o? sysctl_old_user() non-sleepablelocks
Don Lewis [EMAIL PROTECTED] writes: I doubt it. I checked in a fix for this problem today so you should get the fix when you next cvsup. Yup, many thanks. Can you break into ddb and do a ps to find out what state all the processes are in? I'm a newbie to ddb. Was able to get a ps from a hung system but didn't know how to capture it to send to you. Any hints? You might want to try adding the DEBUG_VFS_LOCKS options to your kernel config to see if that turns up anything. Oh, man, I'm getting killed here now. Rebuilt the kernel with that option (not found in GENERIC or other examples in /usr/src/sys/i386/conf/). Now the system is dropping into ddb ever minute or so with complaints like the following on the screen, and in /var/log/messages: Jun 17 21:06:08 PECTOPAH kernel: VOP_GETVOBJECT: 0xc584eb68 is not locked but should be Jun 17 21:08:04 PECTOPAH last message repeated 3 times ... Jun 17 21:18:55 PECTOPAH kernel: VOP_GETVOBJECT: 0xc59346d8 is not locked but should be Jun 17 21:18:59 PECTOPAH last message repeated 5 times Lots 'n' lots of 'em, with a few of the same hex value then another set for a different hex value. There is also ddb command to list the locked vnodes show lockedvnods. After I type cont at ddb a few times the system runs for a while again, only to repeat. When it drops to ddb again that show command doesn't list anything. I may have to remove that option from my kernel just to get to run a bit, even tho eventually the system will hang. It's (of course) my main box which the other systems NFS off, mail server, etc. :-( Are you using nullfs or unionfs which are a bit fragile? Nope. I'd be happy to mail you my kernel config if you want. I've posted it to http://chris.shenton.org/PECTOPAH but if the system's hung again, naturally it won't be available :-( Thanks for your help. Any other things I might try? Dunno if this matters, but I'm using an DELL CERC ATA RAID card with disks showing up as amrd* if that matters. Was flawless at 5.0-{CURRENT,RELEASE}. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 5.1-CURRENT hangs on disk i/o? sysctl_old_user() non-sleepablelocks
Oh, FWIW, I did a cvsup and rebuilt the OS and kernel then did a mergemaster about 30 minutes ago in order to get your fix to my qmail issue. So I'm running about as CURRENT as possible. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: 5.1-CURRENT hangs on disk i/o? sysctl_old_user() non-sleepablelocks
Don Lewis [EMAIL PROTECTED] writes: If you have another machine and a null modem cable you can redirect the system console of the machine to be debugged to a serial port and run some comm software on the other machine so that you can capture all the output from ddb. OK, I'll give that a shot, probably tomorrow. At the ddb prompt, you can do a tr command to get a stack trace, which is likely to be very helpful in pointing out the offending code. Just saw it again, did a tr. From chicken-scratch notes, the last bits are: VOP_GETVOBJECT(...) do_sendfile(...) sendfile(...) syscall(...) Xint0x80_syscall... --- syscall( 393, FreeBSD ELF32, sendfile) ... The next time it dropped into ddb, same sendfile thing. The main services I'm running are qmail, apache, and NFS. Also tftp, rarpd, lpd, sshd, bootparamd ... oh, well, I guess I'm running a bunch of stuff here. :-( Not sure which one, if any, this would be. Unless sendfile() is something in the OS? I'll have to dig up a nullmodem and grab console output. I realise I'm not giving enough detailed info to be very helpful here. If you are running the NFS *client* code on this machine, there is one lock assertion that is easy to trigger. In my kernel config I have this, because a diskless box uses the same kernel, but my /etc/fstab doesn't mount anyone else's NFS exports. options NFSCLIENT #Network Filesystem Client [EMAIL PROTECTED]101 ps -axww|grep nfs 42 ?? IL 0:00.00 (nfsiod 0) 43 ?? IL 0:00.00 (nfsiod 1) 44 ?? IL 0:00.00 (nfsiod 2) 45 ?? IL 0:00.00 (nfsiod 3) 428 ?? Is 0:00.03 nfsd: master (nfsd) 429 ?? I 0:00.09 nfsd: server (nfsd) 430 ?? I 0:00.00 nfsd: server (nfsd) 431 ?? I 0:00.00 nfsd: server (nfsd) 432 ?? I 0:00.00 nfsd: server (nfsd) 35366 p0 R+ 0:00.00 grep nfs At the ddb prompt you should be able to use the write command tweak a couple of variables to modify this behavior. If you set the vfs_badlock_panic variable to zero, the kernel will no longer drop into DDB when one of these lock violations occurs. If you set the vfs_badlock_print variable to zero, the kernel will stop printing the warnings. OK, I've done a examine vfs_badlock_panic which shows it zero, then write vfs_badlock_panic 0 at least for now. Thanks again. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
5.1-CURRENT hangs on disk i/o? sysctl_old_user() non-sleepable locks
(I don't know if this has any relation to the problems I reported yesterday with qmail-send consuming 100% cpu after 5.0 to 5.1 upgrade.) After booting 5.1-CURRENT the system runs fine for a while. Then later most disk i/o related actions seem to hang. E.g., system works but when cron kicks off a glimpseindex in the middle of the night, the system is useless by the morning. If I login on the console as me, it takes my username and password then hangs (trying to run /usr/local/bin/bash?). If I do this as root, I do get a shell (/bin/csh). After a point, asking for top will hang, even as root. Even a reboot hung this morning with nothing in the logs. The system has become almost unusable because of this, requiring frequent reboots or hardware resets. Sometimes when I do something as simple as ps I see this ominous message on the console: sysctl_old_user() with the following non-sleepablelocks held: exclusive sleep mutex process lock r = 0 (0xc50bc9e0) locked @ /usr/src/sys/kern/kern_proc.c:258 which gets into /var/log/messages as: Jun 16 08:33:48 PECTOPAH kernel: exclusive sleep mutex process lock r = 0 (0xc50c7618) locked @ /usr/src/sys/kern/kern_proc.c:258 There are a bunch of these. That file is version: $FreeBSD: src/sys/kern/kern_proc.c,v 1.189 2003/06/14 06:20:25 alc Exp $ and the line is the PROC_LOCK() portion of: struct proc * pfind(pid) register pid_t pid; { register struct proc *p; sx_slock(allproc_lock); LIST_FOREACH(p, PIDHASH(pid), p_hash) if (p-p_pid == pid) { PROC_LOCK(p); break; } sx_sunlock(allproc_lock); return (p); } Any thoughts? Thanks. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
qmail uses 100% cpu after FreeBSD-5.0 to 5.1 upgrade
I've been running qmail for years and like it, installed pretty much per www.LifeWithQmail.org. My main system was running FreeBSD 5.0-RELEASE and -CURRENT and qmail was fine. When I just upgraded to 5.1-CURRENT a couple days back, the qmail-send process started using all CPU. last pid: 22793; load averages: 1.06, 1.02, 1.00 up 0+08:13:46 20:36:32 74 processes: 2 running, 72 sleeping Mem: 38M Active, 51M Inact, 84M Wired, 28K Cache, 73M Buf, 452M Free Swap: 2048M Total, 2048M Free PID USERNAME PRI NICE SIZERES STATETIME WCPUCPU COMMAND 615 qmails 1320 1228K 616K RUN483:00 96.88% 96.88% qmail-send I noticed an identical complaint on the qmail list, to which there have so far been no replies (except you should ask the FreeBSD list): From: Luca Morettoni [EMAIL PROTECTED] Subject: qmail on FreeBSD 5.1-CURRENT To: [EMAIL PROTECTED] [...] qmail is run under daemontools and all work fine (the configuration is 2 years old!), but when I delivery the first mail (localy or remote) the qmail-send process fire up to 100% of CPU infinitely All other mail are right delivery, and the CPU use is the only problem, I see in qmail-send.c that select() function, after the first message, allways return 1 A truss shows me it's running in a tight loop over this code: open(lock/trigger,0x4,027757775230)= 8 (0x8) stat(todo,0xbfbffa00) = 0 (0x0) open(todo,0x4,01) = 9 (0x9) fstat(9,0xbfbffa00) = 0 (0x0) fcntl(0x9,0x2,0x1) = 0 (0x0) fstatfs(0x9,0xbfbff900) = 0 (0x0) getdirentries(0x9,0x8059000,0x1000,0x805a214)= 512 (0x200) gettimeofday(0xbfbffbc8,0x0) = 0 (0x0) select(0x9,0xbfbffcbc,0xbfbffc3c,0x0,0xbfbffc24) = 1 (0x1) gettimeofday(0xbfbffbc8,0x0) = 0 (0x0) gettimeofday(0xbfbffbc8,0x0) = 0 (0x0) select(0x9,0xbfbffcbc,0xbfbffc3c,0x0,0xbfbffc24) = 1 (0x1) gettimeofday(0xbfbffbc8,0x0) = 0 (0x0) getdirentries(0x9,0x8059000,0x1000,0x805a214)= 0 (0x0) lseek(9,0x0,0) = 0 (0x0) close(9) = 0 (0x0) gettimeofday(0xbfbffbc8,0x0) = 0 (0x0) select(0x9,0xbfbffcbc,0xbfbffc3c,0x0,0xbfbffc24) = 1 (0x1) gettimeofday(0xbfbffbc8,0x0) = 0 (0x0) close(8) = 0 (0x0) open(lock/trigger,0x4,027757775230)= 8 (0x8) I see nothing besides usual message delivery information in qmail's logs. Failing that, I rebuilt qmail and it seemed to have fixed it, but I didn't wait long enough: it's pegged at 100% CPU, constantly. If what Luca says is true, maybe it hadn't sent a message yet. Anyone else seen this or know what in FreeBSD-5.1 might have changed to cause this? Any thoughts on how I might go about diagnosing this any better? Thanks. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
DELL CERC amr RAID card beeping, dead drive? how to diagnose/fix?
I have a DELL 600SC which came with a DELL CERC RAID controller. It's recognized by FreeBSD-CURRENT as an amr device even though it's got four ATA disk channels on it instead of the documented SCSI drives for the PERC controller. I have 4x WD1200JB ATA 120GB disks on it which have been running fine for a few months as a set of RAID-5 volumes. From dmesg: amrd0: LSILogic MegaRAID logical drive on amr0 amrd0: MB (20477952 sectors) RAID 5 (optimal) amrd1: LSILogic MegaRAID logical drive on amr0 amrd1: 111093MB (227518464 sectors) RAID 5 (optimal) amrd2: LSILogic MegaRAID logical drive on amr0 amrd2: 111093MB (227518464 sectors) RAID 5 (optimal) amrd3: LSILogic MegaRAID logical drive on amr0 amrd3: 111099MB (227530752 sectors) RAID 5 (optimal) An hour ago, it started beeping at me. I suspect this is the CERC card warning me that one of the disk drives has failed and that I'd better do something about it. :-( Is there a way to diagnose it from a live system, to query which of the four ATA drives it thinks is dead, so I can replace it? (Seems to me that a WD1200JB drive should last a lot longer than a few months it's been running, in a properly ventilated DELL box; any ideas?) Anyone have experience with this CERC controller and replacing a drive? My biggest fear is that I haven't tested the RAID rebuild and that even when I do replace the failed (?) drive it won't do the automatic rebuild and save my data. Other suggestions? Thanks. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: A few 5.0-Release questions...
John Wilson [EMAIL PROTECTED] writes: --- Scott Long [EMAIL PROTECTED] wrote: [Dell PowerEdge] What model? There are quite a few PowerEdges out It's a 600SC - P4 1.8 - Perc3/SC FWIW, I had absolutely no trouble booting and installing 5.0-R on my 600SC, with the DELL-supplied CERC RAID card (amr device recognized it, but it drives 4x ATA disks rather than SCSI), and an Intel gigabit ether card. Got X11 working on it rather easily too. I don't have any other drives (than the supplied IDE CD) in the box. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Diskless: 5.0R scripts, boot, NFS mount problems I didn't have in 4.7S
Matthew Dillon [EMAIL PROTECTED] writes: # ls -lR /conf drwxr-xr-x 5 root wheel 512 Dec 21 10:37 base drwxr-xr-x 3 root wheel 512 Dec 19 21:56 default ... /conf/base/etc: -rw-r--r-- 1 root wheel 18 Dec 19 22:10 diskless_remount -rw-r--r-- 1 root wheel 6 Dec 19 22:22 md_size ... /conf/default/etc: -rw-r--r-- 1 root wheel 184 Feb 18 18:16 fstab -rw-r--r-- 1 root wheel 867 Dec 21 00:04 rc.conf -rw-r--r-- 1 root wheel 197 Feb 18 18:19 rc.local I fiddled standard-supfile to get CURRENT (rather than RELENG_5_0) and am now able to boot with a config like you describe. Thanks! You appear to be doing as diskless(8) suggests: mount the server's / and therefore get its /etc and boot it's kernel (no need to populate a different directory with clone_root). But that kernel must have option BOOTP according to the manpage. If I recompile my server's kernel with this, the diskless client boots but if the server will no longer boot because it's hung sending out bootp requests which noone answers. Seems like diskless clients would have to have separate kernels with the option BOOTP while any servers must omit this option. How do you keep them separate? or am I missing something fundamental? Thanks. PS: could you show me your dhcpd.conf so I can see how you're specifying your root filesystem? Mine's currently: option root-path192.168.255.185:/; To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Diskless: 5.0R scripts, boot, NFS mount problems I didn't have in 4.7S
Matthew Dillon [EMAIL PROTECTED] writes: If you do this pxeboot will attempt to load the kernel via TFTP instead of via NFS. You then put your kernel in /tftpboot right along side a copy of pxeboot. This allows you to netboot a different kernel then the one in the server's root directory. Ah... [sound of lightbulb going on] I was wondering why it would be useful to get the kernel via TFTP rather than the NFS mount. Makes sense. Thanks! To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Diskless: 5.0R scripts, boot, NFS mount problems I didn't have in 4.7S
Matthew Dillon [EMAIL PROTECTED] writes: 4.x and -current use the same mechanism, except 4.x uses MFS and -current uses MD. 4.x uses /etc/diskless[12] while 5.x (by default) uses /etc/rc.d/(init)?diskless. The latter is works very differently than the former. Ignore the handbook. Try 'man diskless'. Ouch, will try the man. kenv is only used in current's rc.diskless scripts, and it resides in /bin on -current. Not on mine: chris@Pectopah103 whereis kenv kenv: /usr/bin/kenv /usr/share/man/man1/kenv.1.gz /usr/src/bin/kenv chris@Pectopah104 ls /bin/kenv ls: /bin/kenv: No such file or directory chris@Pectopah105 uname -a FreeBSD Pectopah.shenton.org 5.0-RELEASE-p1 FreeBSD 5.0-RELEASE-p1 #0: Sun Feb 16 16:10:36 EST 2003 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/Pectopah i386 And the /usr/bin/kenv needs the elf libraries to run, it's not static, so shouldn't live in /sbin and can't run on a diskless box until /usr is mounted. Basically what you do is create a files and directories in /conf/base and /conf/default which are used to populate the MFS/MD root and other directories. I have included my setup at the end. Which startup scripts are you running, old diskless[12] or new rc.d/(init)?diskless ? Thanks for your examples, I'll plow through them tonight. But -- more below -- these sure look like 4.x-compatible stuff, not 5.0. /conf/base: total 5 drwxr-xr-x 2 root wheel 512 Dec 21 10:37 dev drwxr-xr-x 2 root wheel 512 Dec 19 22:22 etc -rw-r--r-- 1 root wheel 11 Dec 20 15:38 etc.remove drwxr-xr-x 2 root wheel 512 Dec 20 14:31 root -rw-r--r-- 1 root wheel 12 Dec 20 15:38 root.remove /conf/base/dev: total 2 -rw-r--r-- 1 root wheel 18 Dec 21 10:37 diskless_remount -rw-r--r-- 1 root wheel 6 Dec 19 22:22 md_size The etc.remove and md_size are used by 4.x's diskless[12] but NOT by the 5.x /etc/rc.d/(init?)diskless scripts. Are you using the old startup rc stuff, possibly changing the default value in /etc/defaults/rc.conf: rc_ng=YES # Set to NO to disable new-style rc scripts. If so and it works for you, I can certainly do the same. But I'd still like to figure how to get the 5.x rc.d/* scripts to do their thing. Actually, I don't see any code to look for that md_size or diskless_remount in either of 5.0's rc.diskless[12] or rc.d/(init)?diskless. I do know that what you're describing is in 4.x's rc.diskless[12], and I did have that working on a 4.7S system. That's why I'm having so much trouble with the 5.0 diskless boot -- everything's changed. Lemme know if I'm way off base but it sounds like you're describing a 4.x diskless boot and my problem's with 5.0. Thanks a bunch! To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Diskless: 5.0R scripts, boot, NFS mount problems I didn't have in 4.7S
Matthew Dillon [EMAIL PROTECTED] writes: Make sure your NFS server is exporting to your subnet and that it is running the necessary services, (portmap, mountd, nfsd -t -u -n 4). My boot server is 5.0, so that's the kernel my diskless box gets. My other boxes are 4.x boxes but I'll try. 4.x's portmap is now 5.x's rpcbind; other processes seem fine too: 41 ?? IL 0:00.00 (nfsiod 0) 42 ?? IL 0:00.00 (nfsiod 1) 43 ?? IL 0:00.00 (nfsiod 2) 44 ?? IL 0:00.00 (nfsiod 3) 276 ?? Ss 0:00.08 /usr/sbin/rpcbind 339 ?? Is 0:00.08 /usr/sbin/mountd -r 345 ?? Is 0:00.02 nfsd: master (nfsd) 347 ?? I 2:03.22 nfsd: server (nfsd) 348 ?? I 0:12.20 nfsd: server (nfsd) 349 ?? I 0:06.56 nfsd: server (nfsd) 350 ?? I 0:02.03 nfsd: server (nfsd) If you have another box that you can boot normally (not netboot), test the NFS server from that box by mounting / and /usr: other# mount 192.168.255.185:/usr /mnt I believe I tried mounting a 4.x volume onto the diskless 5.0 box and it failed in the same way. I didn't take careful notes so I'll repeat. I can mount the 5.0 boot server's /usr onto a 4.7S client with no problem: thanatos(4.7S)# mount 192.168.255.185:/usr /mnt thanatos(4.7S)# mount /dev/da0s1a on / (ufs, local) /dev/da0s1e on /tmp (ufs, local) /dev/da0s1g on /usr (ufs, NFS exported, local) /dev/da0s1d on /usr/local (ufs, NFS exported, local) /dev/da0s1f on /var (ufs, local) procfs on /proc (procfs, local) linprocfs on /usr/compat/linux/proc (linprocfs, local) /dev/da0s1h on /home.THANATOS (ufs, local) pectopah:/home on /home (nfs) pectopah:/usr/local on /usr/localnew (nfs) pectopah:/usr/X11R6 on /usr/local/X11R6 (nfs) 192.168.255.185:/usr on /mnt (nfs) The name pectopah is the addr 192.168.255.185 and is the 5.0 NFS server. So, it seems it's something broken on my 5.0 NFS client's side. But I can mount a 4.7S-exported filesystem onto my 5.0 boot-server so at least its mount_nfs is OK: /sbin/mount_nfs 192.168.255.180:/usr /mnt It is also possible that someone has broken something in NFS recently. The -current I am running (which works fine as a server for my EPIA 5000 and EPIA M 9000) is several weeks old. Hmmm, how could I check this out? I'm happy to do testing and provide feedback. If your /usr partition is on / on your server (i.e. not its own partition), then remember to use the -alldirs option in /etc/exports for / and /usr. If /usr is on its own partition you don't need -alldirs unless you are trying to mount a subdirectory in / or /usr. You *might* need -alldirs on your / export. In anycase, I always set -alldirs on all my read-only exports and that is what I would recommend you do too. I've removed the readonly flags until I get this working. I have separate / and /usr partitions; here's my 5.0 boot-server's /etc/exports file (Kitchen is the diskless box :-) /usr/local -alldirs -maproot=root Sisyphus Thanatos Beatnik Kitchen /usr-alldirs -maproot=root Sisyphus Thanatos Beatnik Kitchen /home-maproot=root Sisyphus Thanatos Beatnik Kitchen And the dhcpd.conf which told the diskless client where to get it's / partition from (and that is successful): host Kitchen.shenton.org { hardware ethernet 00:40:63:c3:89:bb; fixed-address kitchen.shenton.org; filenamepxeboot; option root-path192.168.255.185:/usr/local/diskless; } Am I correct that I only need to have mount_nfs on the diskless client, that I do NOT need an rpcbind running on the diskless client before issuing the mount? Since pxeboot (?) mounts / via NFS, I'm not understanding why mount_nfs can't. Thanks again. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Diskless: 5.0R scripts, boot, NFS mount problems I didn't have in 4.7S
Matthew Dillon [EMAIL PROTECTED] writes: Are you sure you have done a recent buildworld/installworld? It sounds like you haven't. In -current kenv is in /bin (i.e. the source is in /usr/src/bin/kenv on -current) as of the 15th of this month. Well, I'll be danged. I installed 5.0R on Saturday via FTP from ftp*.freebsd.org, did a cvsup ... /usr/share/examples/cvsup/standard-supfile then make world. But I see it in /usr/src/bin/kenv now, cvsupped last night. Rebuilding now. Then need to redo clone_root to populate my diskless root hierarchy. Thanks for the kick in the butt. You must be working off an out of date source tree. Weird, perhaps I fat fingered and cvsupped stable and built that -- installing onto a Current system. That would explain a lot of this ugliness. I have included -current's current /usr/src/etc/rc.d/initdiskless script Thank you. If your sources are out of date you should update them... As you can see, the initidiskless script is full of references to md_size :-) # Copyright (c) 1999 Matt Dillion Ah hah... :-) # $FreeBSD: src/etc/rc.d/initdiskless,v 1.23 2003/02/15 16:29:20 jhay Exp $ OK, this is weird; I cvsup daily but am two versions behind you: Pectopah# cvsup -l 1 -g -h cvsup2.freebsd.org /usr/share/examples/cvsup/standard-supfile Connected to cvsup2.freebsd.org Updating collection src-all/cvs Finished successfully Pectopah# grep '$FreeBSD' /usr/src/etc/rc.d/initdiskless # $FreeBSD: src/etc/rc.d/initdiskless,v 1.21 2002/10/12 10:31:31 schweikh Exp $ Being out of sync would explain a lot. Looks like the tag in standard-supfile points to the wrong thing, rather than the source for CURRENT: # $FreeBSD: src/share/examples/cvsup/standard-supfile,v 1.21.2.1 2003/01/16 05:59:14 scottl Exp $ # # This file contains all of the CVSup collections that make up the # FreeBSD-current source tree. ... *default release=cvs tag=RELENG_5_0 OK, I'll change this to tag=. and recvsup, try again. A big doh! Many thanks. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Diskless: 5.0R scripts, boot, NFS mount problems I didn't have in 4.7S
I was running a VIA Mini-ITX diskless box off a 4.7-STABLE box for a while using a root fs created by the clone_root discussed in the handbook, then some tweaks. I'm having a heck of a time trying to get this running under 5.0-RELEASE, now sync'd to 5.0-CURRENT as of yesterday, then mergemastered. If someone can provide some clues or pointers, I'd be happy to doc how I get it to work (for the Handbook?) and could take a stab at updating clone_root for 5.x if it's needed. Background: Been using FreeBSD since 2.2.x. I can code. I can RTFM. :-) I've read the 5.0 Release Notes and Early Adopters docs. I've read Handbook section 19.6 Diskless Operation and it covers the DHCP, PXE, TFTP, and NFS OK but glosses over how the diskless box actually boots -- what scripts it runs and such. That's where I'm stuck. How does a diskless box know to run a diskless boot script (rather than a standard one)? I'm assuming it invokes init which runs /etc/rc, which then runs /etc/rc.d/* in the 5.0 model. Am I close? I've read Handbook 7.6 Init but it doesn't actually say much about how init hands off to the rc* scripts. The man for rc(8) seems to document the 5.0 rc.d/* well so I'll revisit my diskless boot process. Here's what I have working so far: * isc-dhcpd: offers hostname, IP, location of boot image, root filesystem location * tftpboot: offers pxeboot * pxeboot: gets and runs kernel, mounts root filesystem Then it begins the init/rc startup process and eventually dies. Here's what I've found broken or I can't get past: The clone_root assumes it's copying all the files it needs but (for example) mtree now lives in /usr/sbin instead of /sbin, so it's not copied to the diskless root area so it's not available. kenv lives in /usr/bin, but /usr isn't mounted before kenv is used in rc.diskless1. clone_root wants to run /dev/MAKEDEV but that file doesn't exist in my 5.0 /dev/; I see it in /usr/src/etc/ but it wasn't install by make installworld or mergemaster. (Is this a glitch?) The rc.diskless[12] scripts have changed significantly from 4.7 to 5.0. Are they even used with the new /etc/rc.d/* mechanism? I've run clone_root then manually installed a DISKLESS kernel file into the new location ($DISKLESSROOT)/boot/kernel/kernel. I've manually populated $DISKLESSROOT/conf/default/etc/ with an NFS-oriented fstab, rc.conf, rc.diskless*, and rc.d/[init]diskless and password-related files. Upon boot, after kernel loaded, console shows a bunch of rc.conf-style vars being set, then spews some debugging which I put in $DISKLESSROOT/conf/default/etc/rc.d/diskless, so it's running that rather than the old /etc/rc.diskless* files. I've moved the mount -a near the top of rc.d/diskless since it runs commands which are and not available until /usr is mounted (e.g., mtree). The NFS mount fails with a message I don't understand: [udp] pectopah.shenton.org:/usr: RPCPROG_NFS: RPC: Unknown host This occurs whether I specify a bare hostname, fqdn or IP addr in fstab, even if I put host info in $DISKLESSROOT/conf/default/etc/. Is it really complaining about hosts? or is it an rcpbind thing? Note that it has already mounted the root filesystem a while back. Since it can't mount /usr, everything else fails. Can someone point me in the right direction ? Thanks! To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Diskless: 5.0R scripts, boot, NFS mount problems I didn't have in 4.7S
Chris Shenton [EMAIL PROTECTED] writes: I've moved the mount -a near the top of rc.d/diskless since it runs commands which are and not available until /usr is mounted (e.g., mtree). The NFS mount fails with a message I don't understand: [udp] pectopah.shenton.org:/usr: RPCPROG_NFS: RPC: Unknown host Tasteless self-followup: I get the same error when the boot process fails and drops me to a shell; I can get it with UDP or TCP mounts. For example: mount_nfs -U -2 192.168.255.185:/usr /mnt mount_nfs -U -3 192.168.255.185:/usr /mnt mount_nfs -T -2 192.168.255.185:/usr /mnt mount_nfs -T -3 192.168.255.185:/usr /mnt The only difference is the [udp] vs [tcp] in the error msg: [tcp] 192.168.255.185:/usr: RPCPROG_NFS: RPC: Unknown host I sniffed traffic with tcpdump and ethereal: the diskless client is contacting the server so it's not having problems resolving that IP addr. I'm not hard-core enough to understand what might cause this failure, which occurs in /usr/src/sbin/mount_nfs/mount_nfs.c: if (portspec != NULL) { /* `ai' contains the complete nfsd sockaddr. */ nfs_nb.buf = ai-ai_addr; nfs_nb.len = nfs_nb.maxlen = ai-ai_addrlen; } else { /* Ask the remote rpcbind. */ nfs_nb.buf = nfs_ss; nfs_nb.len = nfs_nb.maxlen = sizeof nfs_ss; if (!rpcb_getaddr(RPCPROG_NFS, nfsvers, nconf, nfs_nb, hostp)) { if (rpc_createerr.cf_stat == RPC_PROGVERSMISMATCH trymntmode == ANY) { trymntmode = V2; goto tryagain; } snprintf(errbuf, sizeof errbuf, [%s] %s:%s: %s, netid, hostp, spec, clnt_spcreateerror(RPCPROG_NFS)); return (returncode(rpc_createerr.cf_stat, rpc_createerr.cf_error)); } } To see if this was a portmapper/rpcbind issue, I tried doing the client mount and specifying the port: mount -o port=2049 192.168.255.185:/usr /mnt and got a slightly different error: [udp] 192.168.255.185:/usr: RPCMNT: clnt_create: RPC: Unknown host and now I'm definitely over my head trying to read mount_nfs.c. :-( I don't understand this since the client is able to mount /usr/local/diskless to get the root filesystem and run the kernel. But I believe pxeboot is doing this, not a full FreeBSD binary. What's the difference in the way the mount? Any suggestions? Thanks. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message