Re: bad sector in gmirror HDD
On Aug 20, 2011, at 06:24 , Jeremy Chadwick wrote: You might also be wondering that dd command writes 512 bytes of zero to that LBA; what about the old data that was there, in the case that the drive remaps the LBA? If you write zeros at OS level to an LBA, you will end up with zeros at that LBA. What else did you expect??? The already remapped LBAs in ATA are not visible anymore to the user/OS. You get a perfectly readable sector. Of course not at the original location, but as you confirmed we are done with CHS addressing. The pending bad sectors are almost always 'corrected', that is, remapped when you write to that LBA. So your script will find only one readable sector and that will be the sector that is pending reallocation. It may be that writing zeros to all free space, like dd if=/dev/zero of=/filesystem/zero bs=1m; rm /filesystem/zero is enough to remap the pending bad block and not have any unreadable sectors. But if the unreadable sector is in a file or directory -- bad luck -- these will need to be rewritten. Once upon a time, BSD/OS had wonderful disk 'repair' utility. It could detect failing disks by reading every sector (had nice visual), or could re-write the drive by reading and writing back every sector. On bad blocks it would retry lots of times and eventually average what was read (with error). Having said that, I doubt modern ATA drives will let anything be read by the pending bad block, but.. who knows. Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
on 18/08/2011 02:15 Steven Hartland said the following: In a nutshell the jail manager we're using will attempt to resurrect the jail from a dieing state in a few specific scenarios. Here's an exmaple:- 1. jail restart requested 2. jail is stopped, so the java processes is killed off, but active tcp sessions may prevent the timely full shutdown of the jail. 3. if an existing jail is detected, i.e. a dieing jail from #2, instead of starting a new jail we attach to the old one and exec the new java process. 4. if an existing jail isnt detected, i.e. where there where not hanging tcp sessions and #2 cleanly shutdown the jail, a new jail is created, attached to and the java exec'ed. The system uses static jailid's so its possible to determine if an existing jail for this service exists or not. This prevents duplicate services as well as making services easy to identify by their jailid. So what we could be seeing is a race between the jail shutdown and the attach of the new process? Not a jail expert at all, but a few suggestions... First, wouldn't the 'persist' jail option simplify your life a little bit? Second, you may want to try to monitor value of prison0.pr_uref variable (e.g. via kgdb) while executing various scenarios of what you do now. If after finishing a certain scenario you end up with a value lower than at the start of scenario, then this is the troublesome one. Please note that prison0.pr_uref is composed from a number of non-jailed processes plus a number of top-level jails. So take this into account when comparing prison0.pr_uref values - it's better to record the initial value when no jails are started and it's important to keep the number of non-jailed processes the same (or to account for its changes). -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
on 20/08/2011 13:02 Andriy Gapon said the following: on 18/08/2011 02:15 Steven Hartland said the following: In a nutshell the jail manager we're using will attempt to resurrect the jail from a dieing state in a few specific scenarios. Here's an exmaple:- 1. jail restart requested 2. jail is stopped, so the java processes is killed off, but active tcp sessions may prevent the timely full shutdown of the jail. 3. if an existing jail is detected, i.e. a dieing jail from #2, instead of starting a new jail we attach to the old one and exec the new java process. 4. if an existing jail isnt detected, i.e. where there where not hanging tcp sessions and #2 cleanly shutdown the jail, a new jail is created, attached to and the java exec'ed. The system uses static jailid's so its possible to determine if an existing jail for this service exists or not. This prevents duplicate services as well as making services easy to identify by their jailid. So what we could be seeing is a race between the jail shutdown and the attach of the new process? Not a jail expert at all, but a few suggestions... First, wouldn't the 'persist' jail option simplify your life a little bit? Second, you may want to try to monitor value of prison0.pr_uref variable (e.g. via kgdb) while executing various scenarios of what you do now. If after finishing a certain scenario you end up with a value lower than at the start of scenario, then this is the troublesome one. Please note that prison0.pr_uref is composed from a number of non-jailed processes plus a number of top-level jails. So take this into account when comparing prison0.pr_uref values - it's better to record the initial value when no jails are started and it's important to keep the number of non-jailed processes the same (or to account for its changes). BTW, I suspect the following scenario, but I am not able to verify it either via testing or in the code: - last process in a dying jail exits - pr_uref of the jail reaches zero - pr_uref of prison0 gets decremented - you attach to the jail and resurrect it - but pr_uref of prison0 stays decremented Repeat this enough times and prison0.pr_uref reaches zero. To reach zero even sooner just kill enough of non-jailed processes. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Remote installing
Hi, Today I liked to live dangerously, and want to upgrade a backups server from i386 to amd64. Just to see if we could. And otherwise I'd scap it and install from usb-stick. So I have my server running amd64 build GENERIC. export /, /var, /usr on the server to be upgraded. But upgrading world dus have a snag already early on: empty changed flags expected schg found none not modified: Operation not supported This is probably where some program wants to set immutable flag on /var/tmp/empy... But looks like NFS does not grok that. Now I seen plenty of sugestions to do it this way, but never saw anybody come back with this complaint So I must be ommiting something ?? --WjW ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Remote installing
On 2011-08-20 13:15, Willem Jan Withagen wrote: Hi, Today I liked to live dangerously, and want to upgrade a backups server from i386 to amd64. Just to see if we could. And otherwise I'd scap it and install from usb-stick. So I have my server running amd64 build GENERIC. export /, /var, /usr on the server to be upgraded. But upgrading world dus have a snag already early on: empty changed flags expected schg found none not modified: Operation not supported This is probably where some program wants to set immutable flag on /var/tmp/empy... But looks like NFS does not grok that. Now I seen plenty of sugestions to do it this way, but never saw anybody come back with this complaint So I must be ommiting something ?? I looked at the work errors. --- cd /mnt/; rm -f /mnt/sys; ln -s usr/src/sys sys cd /mnt/usr/share/man/en.ISO8859-1; ln -sf ../man* . ln: ./man1: Permission denied ln: ./man1aout: Permission denied ln: ./man2: Permission denied ln: ./man3: Permission denied ln: ./man4: Permission denied ln: ./man5: Permission denied ln: ./man6: Permission denied ln: ./man7: Permission denied ln: ./man8: Permission denied ln: ./man9: Permission denied - Which comes from the target distrib-dirs in etc Why would an ln -sf like that fail the filesystems are exported with -maproot=0 --WjW ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
- Original Message - From: Andriy Gapon a...@freebsd.org BTW, I suspect the following scenario, but I am not able to verify it either via testing or in the code: - last process in a dying jail exits - pr_uref of the jail reaches zero - pr_uref of prison0 gets decremented - you attach to the jail and resurrect it - but pr_uref of prison0 stays decremented Repeat this enough times and prison0.pr_uref reaches zero. To reach zero even sooner just kill enough of non-jailed processes. Ahh now that explains all of our experienced panic scenarios:- 1. jail stop / start causing the panic but only after at least a few days worth of uptime. Here what we're seeing is enough leak of pr_uref from the restarted jails to decrement prison0.pr_uref to 0 even with all the standard unjailed processes still running. 2. A machine reboot, after all jails have been stopped but after less time than #2. In this case we haven't seen enough leakage to decrement prison0.pr_uref to 0 given the number or prison0 process but it has been incorrectly decremented, so as soon as the reboot kicks in and prison0 processes start exiting, prison0.pr_uref gets further decremented and again hits 0 when it shouldn't Now if this is the case, we should be able to confirm it with a little more info. 1. What exactly does pr_uref represent? 2. Can what its value should be, be calculated from examining other details of the system i.e. number of running processes, number of running jails? If we can calculate the value that prison0.pr_uref should be, then by examining the machines we have which have been up for a while, we should be able to confirm if an incorrect value is present on them and hence prove this is the case. Ideally a little script to run in kgdb to test this would be the best way to go. Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: USB/coredump hangs in 8 and 9
On Friday 19 August 2011 18:32:13 Andriy Gapon wrote: on 19/08/2011 00:24 Hans Petter Selasky said the following: On Thursday 18 August 2011 19:04:10 Andriy Gapon wrote: If you can help Hans to figure out what you is wrong with USB subsystem in this respect that would help us all. Hi, usb_busdma.c: /* we use mtx_owned() instead of this function */ usb_busdma.c: owned = mtx_owned(uptag-mtx); usb_compat_linux.c: do_unlock = mtx_owned(Giant) ? 0 : 1; usb_compat_linux.c: do_unlock = mtx_owned(Giant) ? 0 : 1; usb_compat_linux.c: do_unlock = mtx_owned(Giant) ? 0 : 1; usb_hub.c: if (mtx_owned(bus-bus_mtx)) { usb_transfer.c: if (!mtx_owned(info-xfer_mtx)) { usb_transfer.c: if (mtx_owned(xfer-xroot-xfer_mtx)) { usb_transfer.c: while (mtx_owned(xroot-udev-bus-bus_mtx)) { usb_transfer.c: while (mtx_owned(xroot-xfer_mtx)) { One fix you will need to do, if mtx_owned is not giving correct value is: First, could you please clarify what is the correct, or rather - expected, value in this case. It's not immediately clear to me if we should consider all locks as owned or un-owned in a situation where all locks are actually skipped behind the scenes. Maybe USB code should explicitly check for that condition as to not make any unsafe assumptions. Second, it's not clear to me what the above list actually represents in the context of this discussion. Hi, The mtx_owned() is not only used to assert mutex ownership, but also to figure out which context the function is being called from. If the correct mutex is not locked already we postpone the work until later. In the panic case, there is no way to postpone work, so this check should be skipped in case of panic, because there is no other thread to put work to. --HPS ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
- Original Message - From: Andriy Gapon a...@freebsd.org BTW, I suspect the following scenario, but I am not able to verify it either via testing or in the code: - last process in a dying jail exits - pr_uref of the jail reaches zero - pr_uref of prison0 gets decremented - you attach to the jail and resurrect it - but pr_uref of prison0 stays decremented Repeat this enough times and prison0.pr_uref reaches zero. To reach zero even sooner just kill enough of non-jailed processes. I've just checked across a number of the panic dumps from the past few days and they all have prison0.pr_uref = 0 which confirms the cause of the panic. I've tried scripting continuous jail start stops, but even after 1000's of iterations have been unable to trigger this on my test machine, so I'm going to dig into the jail code to see if I can find out how its incorrectly decrementing prison0 via inspection. Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: USB/coredump hangs in 8 and 9
on 20/08/2011 16:35 Hans Petter Selasky said the following: On Friday 19 August 2011 18:32:13 Andriy Gapon wrote: on 19/08/2011 00:24 Hans Petter Selasky said the following: On Thursday 18 August 2011 19:04:10 Andriy Gapon wrote: If you can help Hans to figure out what you is wrong with USB subsystem in this respect that would help us all. Hi, usb_busdma.c: /* we use mtx_owned() instead of this function */ usb_busdma.c: owned = mtx_owned(uptag-mtx); usb_compat_linux.c: do_unlock = mtx_owned(Giant) ? 0 : 1; usb_compat_linux.c: do_unlock = mtx_owned(Giant) ? 0 : 1; usb_compat_linux.c: do_unlock = mtx_owned(Giant) ? 0 : 1; usb_hub.c: if (mtx_owned(bus-bus_mtx)) { usb_transfer.c: if (!mtx_owned(info-xfer_mtx)) { usb_transfer.c: if (mtx_owned(xfer-xroot-xfer_mtx)) { usb_transfer.c: while (mtx_owned(xroot-udev-bus-bus_mtx)) { usb_transfer.c: while (mtx_owned(xroot-xfer_mtx)) { One fix you will need to do, if mtx_owned is not giving correct value is: First, could you please clarify what is the correct, or rather - expected, value in this case. It's not immediately clear to me if we should consider all locks as owned or un-owned in a situation where all locks are actually skipped behind the scenes. Maybe USB code should explicitly check for that condition as to not make any unsafe assumptions. Second, it's not clear to me what the above list actually represents in the context of this discussion. Hi, The mtx_owned() is not only used to assert mutex ownership, but also to figure out which context the function is being called from. If the correct mutex is not locked already we postpone the work until later. In the panic case, there is no way to postpone work, so this check should be skipped in case of panic, because there is no other thread to put work to. Now I see, but still I can not make the conclusions... So what would you suggest - should USB code explicitly check for panicstr (or SCHEDULER_STOPPED in the future)? Or what mutex_owned should return - true or false? -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
on 20/08/2011 18:51 Steven Hartland said the following: - Original Message - From: Andriy Gapon a...@freebsd.org BTW, I suspect the following scenario, but I am not able to verify it either via testing or in the code: - last process in a dying jail exits - pr_uref of the jail reaches zero - pr_uref of prison0 gets decremented - you attach to the jail and resurrect it - but pr_uref of prison0 stays decremented Repeat this enough times and prison0.pr_uref reaches zero. To reach zero even sooner just kill enough of non-jailed processes. I've just checked across a number of the panic dumps from the past few days and they all have prison0.pr_uref = 0 which confirms the cause of the panic. I've tried scripting continuous jail start stops, but even after 1000's of iterations have been unable to trigger this on my test machine, so I'm going to dig into the jail code to see if I can find out how its incorrectly decrementing prison0 via inspection. Steve, thanks for doing this! I'll reiterate my suspicion just in case - I think that you should look for the cases where you stop a jail, but then re-attach and resurrect the jail before it's completely dead. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: USB/coredump hangs in 8 and 9
On Saturday 20 August 2011 18:45:57 Andriy Gapon wrote: SCHEDULER_STOPPED The USB code needs to check for the SCHEDULER_STOPPED and cold at the present moment. If this state can be set during bootup, and cleared at the same time like cold, it would be very good. --HPS ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: USB/coredump hangs in 8 and 9
on 20/08/2011 19:54 Hans Petter Selasky said the following: On Saturday 20 August 2011 18:45:57 Andriy Gapon wrote: SCHEDULER_STOPPED The USB code needs to check for the SCHEDULER_STOPPED and cold at the present moment. If this state can be set during bootup, and cleared at the same time like cold, it would be very good. Sorry again - not sure if I follow. SCHEDULER_STOPPED is supposed to be set on panic and never be reset. It's like a mirror of 'cold' in a sense. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: USB/coredump hangs in 8 and 9
On Saturday 20 August 2011 19:09:02 Andriy Gapon wrote: on 20/08/2011 19:54 Hans Petter Selasky said the following: On Saturday 20 August 2011 18:45:57 Andriy Gapon wrote: SCHEDULER_STOPPED The USB code needs to check for the SCHEDULER_STOPPED and cold at the present moment. If this state can be set during bootup, and cleared at the same time like cold, it would be very good. Sorry again - not sure if I follow. SCHEDULER_STOPPED is supposed to be set on panic and never be reset. It's like a mirror of 'cold' in a sense. OK. Then you should add a test !SCHEDULER_STOPPED where I pointed out: static void usbd_callback_wrapper(struct usb_xfer_queue *pq) { struct usb_xfer *xfer = pq-curr; struct usb_xfer_root *info = xfer-xroot; USB_BUS_LOCK_ASSERT(info-bus, MA_OWNED); if (!mtx_owned(info-xfer_mtx) !SCHEDULER_STOPPED) { /* * Cases that end up here: * And also ensure that no mutex asserts can trigger further panics. --HPS ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
On Aug 19, 2011, at 11:24 PM, Jeremy Chadwick wrote: On Fri, Aug 19, 2011 at 09:39:17PM -0400, Dan Langille wrote: On Aug 19, 2011, at 7:21 PM, Jeremy Chadwick wrote: On Fri, Aug 19, 2011 at 04:50:01PM -0400, Dan Langille wrote: System in question: FreeBSD 8.2-STABLE #3: Thu Mar 3 04:52:04 GMT 2011 After a recent power failure, I'm seeing this in my logs: Aug 19 20:36:34 bast smartd[1575]: Device: /dev/ad2, 2 Currently unreadable (pending) sectors I doubt this is related to a power failure. Searching on that error message, I was led to believe that identifying the bad sector and running dd to read it would cause the HDD to reallocate that bad block. http://smartmontools.sourceforge.net/badblockhowto.html This is incorrect (meaning you've misunderstood what's written there). Unreadable LBAs can be a result of the LBA being actually bad (as in uncorrectable), or the LBA being marked suspect. In either case the LBA will return an I/O error when read. If the LBAs are marked suspect, the drive will perform re-analysis of the LBA (to determine if the LBA can be read and the data re-mapped, or if it cannot then the LBA is marked uncorrectable) when you **write** to the LBA. The above smartd output doesn't tell me much. Providing actual SMART attribute data (smartctl -a) for the drive would help. The brand of the drive, the firmware version, and the model all matter -- every drive behaves a little differently. Information such as this? http://beta.freebsddiary.org/smart-fixing-bad-sector.php Yes, perfect. Thank you. First thing first: upgrade smartmontools to 5.41. Your attributes will be the same after you do this (the drive is already in smartmontools' internal drive DB), but I often have to remind people that they really need to keep smartmontools updated as often as possible. The changes between versions are vast; this is especially important for people with SSDs (I'm responsible for submitting some recent improvements for Intel 320 and 510 SSDs). Done. Anyway, the drive (albeit an old PATA Maxtor) appears to have three anomalies: 1) One confirmed reallocated LBA (SMART attribute 5) 2) One suspect LBA (SMART attribute 197) 3) A very high temperature of 51C (SMART attribute 194). If this drive is in an enclosure or in a system with no fans this would be understandable, otherwise this is a bit high. My home workstation which has only one case fan has a drive with more platters than your Maxtor, and it idles at ~38C. Possibly this drive has been undergoing constant I/O recently (which does greatly increase drive temperature)? Not sure. I'm not going to focus too much on this one. This is an older system. I suspect insufficient ventilation. I'll look at getting a new case fan, if not some HDD fans. The SMART error log also indicates an LBA failure at the 26000 hour mark (which is 16 hours prior to when you did smartctl -a /dev/ad2). Whether that LBA is the remapped one or the suspect one is unknown. The LBA was 5566440. The SMART tests you did didn't really amount to anything; no surprise. short and long tests usually do not test the surface of the disk. There are some drives which do it on a long test, but as I said before, everything varies from drive to drive. Furthermore, on this model of drive, you cannot do a surface scans via SMART. Bummer. That's indicated in the Offline data collection capabilities section at the top, where it reads: No Selective Self-test supported. So you'll have to use the dd method. This takes longer than if surface scanning was supported by the drive, but is acceptable. I'll get to how to go about that in a moment. FWIW, I've done a dd read of the entire suspect disk already. Just two errors. From the URL mentioned above: [root@bast:~] # dd of=/dev/null if=/dev/ad2 bs=1m conv=noerror dd: /dev/ad2: Input/output error 2717+0 records in 2717+0 records out 2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec) dd: /dev/ad2: Input/output error 38170+1 records in 38170+1 records out 40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec) [root@bast:~] # That seems to indicate two problems. Are those the values I should be using with dd? I did some more precise testing: # time dd of=/dev/null if=/dev/ad2 bs=512 iseek=5566440 dd: /dev/ad2: Input/output error 9+0 records in 9+0 records out 4608 bytes transferred in 5.368668 secs (858 bytes/sec) real0m5.429s user0m0.000s sys 0m0.010s NOTE: that's 9 blocks later than mentioned in smarctl The above generated this in /var/log/messages: Aug 20 17:29:25 bast kernel: ad2: FAILURE - READ_DMA status=51READY,DSC,ERROR error=40UNCORRECTABLE LBA=5566449 [stuff snipped] That said: http://jdc.parodius.com/freebsd/bad_block_scan If you run this on your ad2 drive, I'm hoping what you'll find are two LBAs which can't be read -- one will be the remapped LBA
Re: 32GB limit per swap device?
On Sat, Aug 20, 2011 at 12:33:29PM -0500, Alan Cox wrote: On Thu, Aug 18, 2011 at 3:16 AM, Alexander V. Chernikov melif...@ipfw.ruwrote: On 10.08.2011 19:16, per...@pluto.rain.com wrote: Chuck Swigercswi...@mac.com wrote: On Aug 9, 2011, at 7:26 AM, Daniel Kalchev wrote: I am trying to set up 64GB partitions for swap for a system that has 64GB of RAM (with the idea to dump kernel core etc). But, on 8-stable as of today I get: WARNING: reducing size to maximum of 67108864 blocks per swap unit Is there workaround for this limitation? Another interesting question: swap pager operates in page blocks (PAGE_SIZE=4k on common arch). Block device size in passed to swaponsomething() in number of _disk_ blocks (e.g. in DEV_BSIZE=512). After that, kernel b-lists (on top of which swap pager is build) maximum objects check is enforced. The (possible) problem is that real object count we will operate on is not the value passed to swaponsomething() since it is calculated in wrong units. we should check b-list limit on (X * DEV_BSIZE512 / PAGE_SIZE) value which is rough (X / 8) so we should be able to address 32*8=256G. The code should look like this: Index: vm/swap_pager.c ==**==**=== --- vm/swap_pager.c (revision 223877) +++ vm/swap_pager.c (working copy) @@ -2129,6 +2129,15 @@ swaponsomething(struct vnode *vp, void *id, u_long u_long mblocks; /* +* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. +* First chop nblks off to page-align it, then convert. +* +* sw-sw_nblks is in page-sized chunks now too. +*/ + nblks = ~(ctodb(1) - 1); + nblks = dbtoc(nblks); + + /* * If we go beyond this, we get overflows in the radix * tree bitmap code. */ @@ -2138,14 +2147,6 @@ swaponsomething(struct vnode *vp, void *id, u_long mblocks); nblks = mblocks; } - /* -* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. -* First chop nblks off to page-align it, then convert. -* -* sw-sw_nblks is in page-sized chunks now too. -*/ - nblks = ~(ctodb(1) - 1); - nblks = dbtoc(nblks); sp = malloc(sizeof *sp, M_VMPGDATA, M_WAITOK | M_ZERO); sp-sw_vp = vp; (move pages recalculation before b-list check) Can someone comment on this? I believe that you are correct. Have you tried testing this change on a large swap device? I probably agree too, but I am in the process of re-reading the swap code, and I do not quite believe in the limit. When the initial code was committed, our daddr_t was 32bit, I checked the RELENG_4 sources. Current code uses int64_t for daddr_t. My impression right now is that we only utilize the low 32bits of daddr_t. Esp. interesting looks the following typedef: typedef uint32_tu_daddr_t; /* unsigned disk address */ which (correctly) means that typical mask (u_daddr_t)-1 is 0x. I wonder whether we could just use full 64bit and de-facto remove the limitation on the swap partition size. pgpJVixGsCJlw.pgp Description: PGP signature
Re: bad sector in gmirror HDD
You can run long self-test in smartmontools (-t long). Then you can get failed sector number from the smartmontools (-l selftest) and then you can use DD to write zero to the specific sector. Also i am highly recommending to setup smartd as daemon and to monitor number of relocated sectors. If they will grow again - then it is a good time to utilize this disk. [root@bast:~] # dd of=/dev/null if=/dev/ad2 bs=1m conv=noerror dd: /dev/ad2: Input/output error 2717+0 records in 2717+0 records out 2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec) dd: /dev/ad2: Input/output error 38170+1 records in 38170+1 records out 40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec) [root@bast:~] # That seems to indicate two problems. Are those the values I should be using with dd? ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 32GB limit per swap device?
On Thu, Aug 18, 2011 at 3:16 AM, Alexander V. Chernikov melif...@ipfw.ruwrote: On 10.08.2011 19:16, per...@pluto.rain.com wrote: Chuck Swigercswi...@mac.com wrote: On Aug 9, 2011, at 7:26 AM, Daniel Kalchev wrote: I am trying to set up 64GB partitions for swap for a system that has 64GB of RAM (with the idea to dump kernel core etc). But, on 8-stable as of today I get: WARNING: reducing size to maximum of 67108864 blocks per swap unit Is there workaround for this limitation? Another interesting question: swap pager operates in page blocks (PAGE_SIZE=4k on common arch). Block device size in passed to swaponsomething() in number of _disk_ blocks (e.g. in DEV_BSIZE=512). After that, kernel b-lists (on top of which swap pager is build) maximum objects check is enforced. The (possible) problem is that real object count we will operate on is not the value passed to swaponsomething() since it is calculated in wrong units. we should check b-list limit on (X * DEV_BSIZE512 / PAGE_SIZE) value which is rough (X / 8) so we should be able to address 32*8=256G. The code should look like this: Index: vm/swap_pager.c ==**==**=== --- vm/swap_pager.c (revision 223877) +++ vm/swap_pager.c (working copy) @@ -2129,6 +2129,15 @@ swaponsomething(struct vnode *vp, void *id, u_long u_long mblocks; /* +* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. +* First chop nblks off to page-align it, then convert. +* +* sw-sw_nblks is in page-sized chunks now too. +*/ + nblks = ~(ctodb(1) - 1); + nblks = dbtoc(nblks); + + /* * If we go beyond this, we get overflows in the radix * tree bitmap code. */ @@ -2138,14 +2147,6 @@ swaponsomething(struct vnode *vp, void *id, u_long mblocks); nblks = mblocks; } - /* -* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. -* First chop nblks off to page-align it, then convert. -* -* sw-sw_nblks is in page-sized chunks now too. -*/ - nblks = ~(ctodb(1) - 1); - nblks = dbtoc(nblks); sp = malloc(sizeof *sp, M_VMPGDATA, M_WAITOK | M_ZERO); sp-sw_vp = vp; (move pages recalculation before b-list check) Can someone comment on this? I believe that you are correct. Have you tried testing this change on a large swap device? Alan ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
On Sat, Aug 20, 2011 at 01:34:41PM -0400, Dan Langille wrote: On Aug 19, 2011, at 11:24 PM, Jeremy Chadwick wrote: On Fri, Aug 19, 2011 at 09:39:17PM -0400, Dan Langille wrote: ... Information such as this? http://beta.freebsddiary.org/smart-fixing-bad-sector.php ... 3) A very high temperature of 51C (SMART attribute 194). If this drive is in an enclosure or in a system with no fans this would be ... eh? What's the temperature of the second drive? ... This is an older system. I suspect insufficient ventilation. I'll look at getting a new case fan, if not some HDD fans. ... I still suggest you replace the drive, although given its age I doubt Older drive and errors starting to happen, replace ASAP. you'll be able to find a suitable replacement. I tend to keep disks like this around for testing/experimental purposes and not for actual use. I have several unused 80GB HDD I can place into this system. I think that's what I'll wind up doing. But I'd like to follow this process through and get it documented for future reference. If the data is valuable, the sooner the better. It's actually somewhat saner if the two drives are not from the same lot. -- Dan Langille - http://langille.org - Diane -- - d...@freebsd.org d...@db.net http://www.db.net/~db Why leave money to our children if we don't leave them the Earth? ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
On Aug 20, 2011, at 1:54 PM, Alex Samorukov wrote: [root@bast:~] # dd of=/dev/null if=/dev/ad2 bs=1m conv=noerror dd: /dev/ad2: Input/output error 2717+0 records in 2717+0 records out 2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec) dd: /dev/ad2: Input/output error 38170+1 records in 38170+1 records out 40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec) [root@bast:~] # That seems to indicate two problems. Are those the values I should be using with dd? You can run long self-test in smartmontools (-t long). Then you can get failed sector number from the smartmontools (-l selftest) and then you can use DD to write zero to the specific sector. Already done: http://beta.freebsddiary.org/smart-fixing-bad-sector.php Search for 786767 Or did you mean something else? That doesn't seem to map to a particular sector though... I ran it for a while... # time dd of=/dev/null if=/dev/ad2 bs=512 iseek=786767 ^C4301949+0 records in 4301949+0 records out 2202597888 bytes transferred in 780.245828 secs (2822954 bytes/sec) real13m0.256s user0m22.087s sys 3m24.215s Also i am highly recommending to setup smartd as daemon and to monitor number of relocated sectors. If they will grow again - then it is a good time to utilize this disk. It is running, but with nothing custom in the .conf file. -- Dan Langille - http://langille.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
On Aug 20, 2011, at 2:04 PM, Diane Bruce wrote: On Sat, Aug 20, 2011 at 01:34:41PM -0400, Dan Langille wrote: On Aug 19, 2011, at 11:24 PM, Jeremy Chadwick wrote: On Fri, Aug 19, 2011 at 09:39:17PM -0400, Dan Langille wrote: ... Information such as this? http://beta.freebsddiary.org/smart-fixing-bad-sector.php ... 3) A very high temperature of 51C (SMART attribute 194). If this drive is in an enclosure or in a system with no fans this would be ... eh? What's the temperature of the second drive? Roughly the same: [root@bast:/home/dan/tmp] # smartctl -a /dev/ad2 | grep -i temp 194 Temperature_Celsius 0x0022 080 076 042Old_age Always - 51 [root@bast:/home/dan/tmp] # smartctl -a /dev/ad0 | grep -i temp 194 Temperature_Celsius 0x0022 081 074 042Old_age Always - 49 [root@bast:/home/dan/tmp] # FYI, when I first set up smartd, I questioned those values. The HDD in question, at the time, did not feel hot to the touch. ... This is an older system. I suspect insufficient ventilation. I'll look at getting a new case fan, if not some HDD fans. ... I still suggest you replace the drive, although given its age I doubt Older drive and errors starting to happen, replace ASAP. you'll be able to find a suitable replacement. I tend to keep disks like this around for testing/experimental purposes and not for actual use. I have several unused 80GB HDD I can place into this system. I think that's what I'll wind up doing. But I'd like to follow this process through and get it documented for future reference. If the data is valuable, the sooner the better. It's actually somewhat saner if the two drives are not from the same lot. Noted. -- Dan Langille - http://langille.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
Repeat this enough times and prison0.pr_uref reaches zero. To reach zero even sooner just kill enough of non-jailed processes. Interesting. We've been getting kernel panics in -stable but with only one jail started at boot without being restarted. Are you using SAS drives by any chance? Setting ethernet polling and HZ? How about softupdates, gmirror, and/or anything in sysctl.conf? Roger Marquis ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
- Original Message - From: Roger Marquis marq...@roble.com To: freebsd-j...@freebsd.org; freebsd-stable@FreeBSD.org Sent: Saturday, August 20, 2011 7:10 PM Subject: Re: debugging frequent kernel panics on 8.2-RELEASE Repeat this enough times and prison0.pr_uref reaches zero. To reach zero even sooner just kill enough of non-jailed processes. Interesting. We've been getting kernel panics in -stable but with only one jail started at boot without being restarted. Are you using SAS drives by any chance? Setting ethernet polling and HZ? How about softupdates, gmirror, and/or anything in sysctl.conf? If your not restarting things it may be unrelated. No SAS, polling is compiled in but no devices have it active and using ZFS only. Are you seeing a double fault panic? Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
On Sat, Aug 20, 2011 at 07:54:30PM +0200, Alex Samorukov wrote: You can run long self-test in smartmontools (-t long). Then you can get failed sector number from the smartmontools (-l selftest) and then you can use DD to write zero to the specific sector. This is inaccurate advice. I covered this in my reply already as well: http://lists.freebsd.org/pipermail/freebsd-stable/2011-August/063665.html Quote: The SMART tests you did didn't really amount to anything; no surprise. short and long tests usually do not test the surface of the disk. There are some drives which do it on a long test, but as I said before, everything varies from drive to drive. TL;DR version: smartctl -t long != smartctl -t select. The OP's drive does not support selective scans (-t select), and long turned up nothing (no surprise there either). So, using dd to find the bad LBAs is the only choice he has. Also i am highly recommending to setup smartd as daemon and to monitor number of relocated sectors. If they will grow again - then it is a good time to utilize this disk. You have to know what you're looking at and how to interpret the data smartd gives you for it to be useful. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
Dan, I will respond to your reply sometime tomorrow. I do not have time to review the Email today (~7.7KBytes), but will have time tomorrow. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 32GB limit per swap device?
On 08/20/2011 12:41, Kostik Belousov wrote: On Sat, Aug 20, 2011 at 12:33:29PM -0500, Alan Cox wrote: On Thu, Aug 18, 2011 at 3:16 AM, Alexander V. Chernikovmelif...@ipfw.ruwrote: On 10.08.2011 19:16, per...@pluto.rain.com wrote: Chuck Swigercswi...@mac.com wrote: On Aug 9, 2011, at 7:26 AM, Daniel Kalchev wrote: I am trying to set up 64GB partitions for swap for a system that has 64GB of RAM (with the idea to dump kernel core etc). But, on 8-stable as of today I get: WARNING: reducing size to maximum of 67108864 blocks per swap unit Is there workaround for this limitation? Another interesting question: swap pager operates in page blocks (PAGE_SIZE=4k on common arch). Block device size in passed to swaponsomething() in number of _disk_ blocks (e.g. in DEV_BSIZE=512). After that, kernel b-lists (on top of which swap pager is build) maximum objects check is enforced. The (possible) problem is that real object count we will operate on is not the value passed to swaponsomething() since it is calculated in wrong units. we should check b-list limit on (X * DEV_BSIZE512 / PAGE_SIZE) value which is rough (X / 8) so we should be able to address 32*8=256G. The code should look like this: Index: vm/swap_pager.c ==**==**=== --- vm/swap_pager.c (revision 223877) +++ vm/swap_pager.c (working copy) @@ -2129,6 +2129,15 @@ swaponsomething(struct vnode *vp, void *id, u_long u_long mblocks; /* +* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. +* First chop nblks off to page-align it, then convert. +* +* sw-sw_nblks is in page-sized chunks now too. +*/ + nblks= ~(ctodb(1) - 1); + nblks = dbtoc(nblks); + + /* * If we go beyond this, we get overflows in the radix * tree bitmap code. */ @@ -2138,14 +2147,6 @@ swaponsomething(struct vnode *vp, void *id, u_long mblocks); nblks = mblocks; } - /* -* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. -* First chop nblks off to page-align it, then convert. -* -* sw-sw_nblks is in page-sized chunks now too. -*/ - nblks= ~(ctodb(1) - 1); - nblks = dbtoc(nblks); sp = malloc(sizeof *sp, M_VMPGDATA, M_WAITOK | M_ZERO); sp-sw_vp = vp; (move pages recalculation before b-list check) Can someone comment on this? I believe that you are correct. Have you tried testing this change on a large swap device? I probably agree too, but I am in the process of re-reading the swap code, and I do not quite believe in the limit. I'm uncertain whether the current limit, 0x4000 / BLIST_META_RADIX, is exact or not, but I doubt that it is too large. When the initial code was committed, our daddr_t was 32bit, I checked the RELENG_4 sources. Current code uses int64_t for daddr_t. My impression right now is that we only utilize the low 32bits of daddr_t. Esp. interesting looks the following typedef: typedef uint32_tu_daddr_t; /* unsigned disk address */ which (correctly) means that typical mask (u_daddr_t)-1 is 0x. I wonder whether we could just use full 64bit and de-facto remove the limitation on the swap partition size. I would rather argue first that the subr_list code should not be using daddr_t all. The code is abusing daddr_t and defining u_daddr_t to represent things that are not disk addresses. Instead, it should either define its own type or directly use (u)int*_t. Then, as for choosing between 32 and 64 bits, I'm skeptical of using this structure for managing more than 32 bits worth of blocks, given the amount of RAM it will use. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
On Aug 20, 2011, at 2:36 PM, Jeremy Chadwick wrote: Dan, I will respond to your reply sometime tomorrow. I do not have time to review the Email today (~7.7KBytes), but will have time tomorrow. No worries. Thank you. -- Dan Langille - http://langille.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
The SMART tests you did didn't really amount to anything; no surprise. short and long tests usually do not test the surface of the disk. There are some drives which do it on a long test, but as I said before, everything varies from drive to drive. It is not correct statement, sorry. Long test trying to read all the data from surface (and doing some other things). // one of the smartmontools developers and sysutils/smartmontools maintainer. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 32GB limit per swap device?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Alan Cox wrote: On 08/20/2011 12:41, Kostik Belousov wrote: On Sat, Aug 20, 2011 at 12:33:29PM -0500, Alan Cox wrote: On Thu, Aug 18, 2011 at 3:16 AM, Alexander V. Chernikovmelif...@ipfw.ruwrote: On 10.08.2011 19:16, per...@pluto.rain.com wrote: Chuck Swigercswi...@mac.com wrote: On Aug 9, 2011, at 7:26 AM, Daniel Kalchev wrote: I am trying to set up 64GB partitions for swap for a system that has 64GB of RAM (with the idea to dump kernel core etc). But, on 8-stable as of today I get: WARNING: reducing size to maximum of 67108864 blocks per swap unit Is there workaround for this limitation? Another interesting question: swap pager operates in page blocks (PAGE_SIZE=4k on common arch). Block device size in passed to swaponsomething() in number of _disk_ blocks (e.g. in DEV_BSIZE=512). After that, kernel b-lists (on top of which swap pager is build) maximum objects check is enforced. The (possible) problem is that real object count we will operate on is not the value passed to swaponsomething() since it is calculated in wrong units. we should check b-list limit on (X * DEV_BSIZE512 / PAGE_SIZE) value which is rough (X / 8) so we should be able to address 32*8=256G. The code should look like this: Index: vm/swap_pager.c ==**==**=== --- vm/swap_pager.c (revision 223877) +++ vm/swap_pager.c (working copy) @@ -2129,6 +2129,15 @@ swaponsomething(struct vnode *vp, void *id, u_long u_long mblocks; /* +* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. +* First chop nblks off to page-align it, then convert. +* +* sw-sw_nblks is in page-sized chunks now too. +*/ + nblks= ~(ctodb(1) - 1); + nblks = dbtoc(nblks); + + /* * If we go beyond this, we get overflows in the radix * tree bitmap code. */ @@ -2138,14 +2147,6 @@ swaponsomething(struct vnode *vp, void *id, u_long mblocks); nblks = mblocks; } - /* -* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. -* First chop nblks off to page-align it, then convert. -* -* sw-sw_nblks is in page-sized chunks now too. -*/ - nblks= ~(ctodb(1) - 1); - nblks = dbtoc(nblks); sp = malloc(sizeof *sp, M_VMPGDATA, M_WAITOK | M_ZERO); sp-sw_vp = vp; (move pages recalculation before b-list check) Can someone comment on this? I believe that you are correct. Have you tried testing this change on a large swap device? I will try tomorrow. I probably agree too, but I am in the process of re-reading the swap code, and I do not quite believe in the limit. I'm uncertain whether the current limit, 0x4000 / BLIST_META_RADIX, is exact or not, but I doubt that it is too large. It is not exact. It is rough estimation of sizeof(blmeta_t) * X 4G (blist_create() assumes malloc() not being able to allocate more that 4G. I'm not sure if it is true this days) X is number of blocks we need to store. Actual number, however, it is X / (1 + 1/BLIST_META_RADIX + 1/BLIST_META_RADIX^2 + ...) but it dffers from X not very much. blist can be seen as tree of radix trees, with metainformation for all those radix trees allocated by single allocation which imposes this limit. Metatinformation is used to find free blocks more quickly Single linear allocation is required to advance to next radix tree on the same level very fast: * * * * * ** ** ** ** ** ^^^ Some kind of schema with 3 level in tree and BLIST_META_RADIX=2 (instead of 16). When the initial code was committed, our daddr_t was 32bit, I checked the RELENG_4 sources. Current code uses int64_t for daddr_t. My impression right now is that we only utilize the low 32bits of daddr_t. Esp. interesting looks the following typedef: typedefuint32_tu_daddr_t;/* unsigned disk address */ which (correctly) means that typical mask (u_daddr_t)-1 is 0x. I wonder whether we could just use full 64bit and de-facto remove the limitation on the swap partition size. This will increase struct blmeta_t twice and cause 2*X memory usage for every swap configuration. I would rather argue first that the subr_list code should not be using daddr_t all. The code is abusing daddr_t and defining u_daddr_t to represent things that are not disk addresses. Instead, it should either define its own type or directly use (u)int*_t. Then, as for choosing between 32 and 64 bits, I'm skeptical of using this structure for managing more than 32 bits worth of blocks, given the amount of RAM it will use. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
Re: 32GB limit per swap device?
On Sat, Aug 20, 2011 at 10:42:28PM +0400, Alexander V. Chernikov wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Alan Cox wrote: On 08/20/2011 12:41, Kostik Belousov wrote: On Sat, Aug 20, 2011 at 12:33:29PM -0500, Alan Cox wrote: On Thu, Aug 18, 2011 at 3:16 AM, Alexander V. Chernikovmelif...@ipfw.ruwrote: On 10.08.2011 19:16, per...@pluto.rain.com wrote: Chuck Swigercswi...@mac.com wrote: On Aug 9, 2011, at 7:26 AM, Daniel Kalchev wrote: I am trying to set up 64GB partitions for swap for a system that has 64GB of RAM (with the idea to dump kernel core etc). But, on 8-stable as of today I get: WARNING: reducing size to maximum of 67108864 blocks per swap unit Is there workaround for this limitation? Another interesting question: swap pager operates in page blocks (PAGE_SIZE=4k on common arch). Block device size in passed to swaponsomething() in number of _disk_ blocks (e.g. in DEV_BSIZE=512). After that, kernel b-lists (on top of which swap pager is build) maximum objects check is enforced. The (possible) problem is that real object count we will operate on is not the value passed to swaponsomething() since it is calculated in wrong units. we should check b-list limit on (X * DEV_BSIZE512 / PAGE_SIZE) value which is rough (X / 8) so we should be able to address 32*8=256G. The code should look like this: Index: vm/swap_pager.c ==**==**=== --- vm/swap_pager.c (revision 223877) +++ vm/swap_pager.c (working copy) @@ -2129,6 +2129,15 @@ swaponsomething(struct vnode *vp, void *id, u_long u_long mblocks; /* +* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. +* First chop nblks off to page-align it, then convert. +* +* sw-sw_nblks is in page-sized chunks now too. +*/ + nblks= ~(ctodb(1) - 1); + nblks = dbtoc(nblks); + + /* * If we go beyond this, we get overflows in the radix * tree bitmap code. */ @@ -2138,14 +2147,6 @@ swaponsomething(struct vnode *vp, void *id, u_long mblocks); nblks = mblocks; } - /* -* nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. -* First chop nblks off to page-align it, then convert. -* -* sw-sw_nblks is in page-sized chunks now too. -*/ - nblks= ~(ctodb(1) - 1); - nblks = dbtoc(nblks); sp = malloc(sizeof *sp, M_VMPGDATA, M_WAITOK | M_ZERO); sp-sw_vp = vp; (move pages recalculation before b-list check) Can someone comment on this? I believe that you are correct. Have you tried testing this change on a large swap device? I will try tomorrow. I probably agree too, but I am in the process of re-reading the swap code, and I do not quite believe in the limit. I'm uncertain whether the current limit, 0x4000 / BLIST_META_RADIX, is exact or not, but I doubt that it is too large. It is not exact. It is rough estimation of sizeof(blmeta_t) * X 4G (blist_create() assumes malloc() not being able to allocate more that 4G. I'm not sure if it is true this days) X is number of blocks we need to store. Actual number, however, it is X / (1 + 1/BLIST_META_RADIX + 1/BLIST_META_RADIX^2 + ...) but it dffers from X not very much. blist can be seen as tree of radix trees, with metainformation for all those radix trees allocated by single allocation which imposes this limit. Metatinformation is used to find free blocks more quickly Single linear allocation is required to advance to next radix tree on the same level very fast: * * * * * ** ** ** ** ** ^^^ Some kind of schema with 3 level in tree and BLIST_META_RADIX=2 (instead of 16). When the initial code was committed, our daddr_t was 32bit, I checked the RELENG_4 sources. Current code uses int64_t for daddr_t. My impression right now is that we only utilize the low 32bits of daddr_t. Esp. interesting looks the following typedef: typedefuint32_tu_daddr_t;/* unsigned disk address */ which (correctly) means that typical mask (u_daddr_t)-1 is 0x. I wonder whether we could just use full 64bit and de-facto remove the limitation on the swap partition size. This will increase struct blmeta_t twice and cause 2*X memory usage for every swap configuration. No, daddr_t is already 64bit. Nothing will increase. My point is the current limitation is artificial. I think Alan note referred to the amount of the radix tree nodes required to cover the large swap partition. But it could be a good temporary measure. I expect to be able to provide some numeric evidence later. I would rather argue first that the subr_list code
Re: bad sector in gmirror HDD
On Sat, Aug 20, 2011 at 08:43:09PM +0200, Alex Samorukov wrote: The SMART tests you did didn't really amount to anything; no surprise. short and long tests usually do not test the surface of the disk. There are some drives which do it on a long test, but as I said before, everything varies from drive to drive. It is not correct statement, sorry. Long test trying to read all the data from surface (and doing some other things). // one of the smartmontools developers and sysutils/smartmontools maintainer. That's great, but too bad it's generally not true in practise. Dan's long scan on his site proves it, and I've dealt with this situation myself many times over. SMART long tests *may* do a surface scan, but in most cases they just seem to do something that's similar to short but over a longer period of time. Furthermore, some which *do* do a surface scan on a long test don't always report LBA failures in the self-test log. I've personally seen this happen on Western Digital disks (model strings are unknown, I'm certain I've rid myself of those drives). Firmware bug/quirk? Possibly, but at the end of the day it doesn't matter -- it means the end-user has wasted 2-3 hours for something that tests OK yet we know for a fact isn't OK. I *have* seen a drive do a surface scan on a long test and report LBAs it couldn't read, but as I said, it's rare and varies from vendor to vendor, drive to drive, and firmware to firmware. When it happened I was very, very surprised (and delighted). The only thing I can trust 100% of the time when it comes to surface scans is SMART selective scans (if available, which again the OP's drive does not offer this), or using dd or a read-per-LBA on the OS level (which works everywhere). -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Remote installing
On 20-8-2011 13:26, Willem Jan Withagen wrote: On 2011-08-20 13:15, Willem Jan Withagen wrote: Hi, Today I liked to live dangerously, and want to upgrade a backups server from i386 to amd64. Just to see if we could. And otherwise I'd scap it and install from usb-stick. So I have my server running amd64 build GENERIC. export /, /var, /usr on the server to be upgraded. But upgrading world dus have a snag already early on: empty changed flags expected schg found none not modified: Operation not supported This is probably where some program wants to set immutable flag on /var/tmp/empy... But looks like NFS does not grok that. Now I seen plenty of sugestions to do it this way, but never saw anybody come back with this complaint So I must be ommiting something ?? I looked at the work errors. --- cd /mnt/; rm -f /mnt/sys; ln -s usr/src/sys sys cd /mnt/usr/share/man/en.ISO8859-1; ln -sf ../man* . ln: ./man1: Permission denied ln: ./man1aout: Permission denied ln: ./man2: Permission denied ln: ./man3: Permission denied ln: ./man4: Permission denied ln: ./man5: Permission denied ln: ./man6: Permission denied ln: ./man7: Permission denied ln: ./man8: Permission denied ln: ./man9: Permission denied - Which comes from the target distrib-dirs in etc Why would an ln -sf like that fail the filesystems are exported with -maproot=0 Well turned out that the easiest fix was to run chflags -R noschg / at the client, because certain files are immutable and once you run into those, it is hard to fix it after the fact. Next would be to move /lib and /usr/lib out of the way. So that doesn't cause conflict in near future. Which will cause new programs to start to fail. So better make shure that every thing is set before you start upgrading over NFS. But I did manage to get it upgraded from i386 to amd64. --WjW ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
Dan, sorry for the previous mail. Seems my schedule today has just unexpected changed; I had social events to deal with but as I found out a few minutes ago those events are cancelled, which means I have time today to look at your mail. On Sat, Aug 20, 2011 at 01:34:41PM -0400, Dan Langille wrote: On Aug 19, 2011, at 11:24 PM, Jeremy Chadwick wrote: The SMART error log also indicates an LBA failure at the 26000 hour mark (which is 16 hours prior to when you did smartctl -a /dev/ad2). Whether that LBA is the remapped one or the suspect one is unknown. The LBA was 5566440. The SMART tests you did didn't really amount to anything; no surprise. short and long tests usually do not test the surface of the disk. There are some drives which do it on a long test, but as I said before, everything varies from drive to drive. Furthermore, on this model of drive, you cannot do a surface scans via SMART. Bummer. That's indicated in the Offline data collection capabilities section at the top, where it reads: No Selective Self-test supported. So you'll have to use the dd method. This takes longer than if surface scanning was supported by the drive, but is acceptable. I'll get to how to go about that in a moment. FWIW, I've done a dd read of the entire suspect disk already. Just two errors. Actually one error -- keep reading. From the URL mentioned above: [root@bast:~] # dd of=/dev/null if=/dev/ad2 bs=1m conv=noerror dd: /dev/ad2: Input/output error 2717+0 records in 2717+0 records out 2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec) dd: /dev/ad2: Input/output error 38170+1 records in 38170+1 records out 40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec) [root@bast:~] # That seems to indicate two problems. Are those the values I should be using with dd? The values you refer to are byte offsets, not LBAs. Furthermore, you used a block size of 1 megabyte (not sure why people keep doing this). LBA size on your drive is 512 bytes; asking for 1 megabyte in dd is going to make the drive try to read() 1MByte, and an I/O error could happen anywhere within that 1MByte range. (1024*1024) / 512 == 2048 LBAs make up 1MByte. Next, remember that the noerror attribute has some quirks associated with it that need to be kept in mind. The man page discusses these. Finally, I believe the last I/O error you see (at byte 40025063424) is normal given what you told dd to do. It was trying to use bs=1m, and your drive has a capacity limit of 40027029504 bytes. I'm left to believe you had a short read (less than 1MByte), so this is normal. 40027029504 / (1024*1024) == 38172.75, which is not a round number, hence the error. I did some more precise testing: # time dd of=/dev/null if=/dev/ad2 bs=512 iseek=5566440 dd: /dev/ad2: Input/output error 9+0 records in 9+0 records out 4608 bytes transferred in 5.368668 secs (858 bytes/sec) real 0m5.429s user 0m0.000s sys 0m0.010s NOTE: that's 9 blocks later than mentioned in smarctl The above generated this in /var/log/messages: Aug 20 17:29:25 bast kernel: ad2: FAILURE - READ_DMA status=51READY,DSC,ERROR error=40UNCORRECTABLE LBA=5566449 Your dd command above is saying use a block size of 512 bytes, and read indefinitely from /dev/ad2, starting with an lseek() on /dev/ad2 of 5566440. You then get an I/O error somewhere from where you start to when the device ends. You're assuming that the number of bytes transferred indicates where the actual error happened, which in my experience is not always true. What really needs to happen here is use of count=1, and you adjusting iseek manually per each LBA. Or you could use the script I wrote and let the computer do it for you. :-) I understand what you're getting at, re: that's 9 blocks later. But the OS does some caching of I/O and so on sometimes, or aggregates block reads larger than physical LBA size, so that may be what's going on here. However, if you keep reading, you might find your answer is that you may (still unsure) have other LBAs which are now marked suspect. That said: http://jdc.parodius.com/freebsd/bad_block_scan If you run this on your ad2 drive, I'm hoping what you'll find are two LBAs which can't be read -- one will be the remapped LBA and one will be the suspect LBA. If you only get one LBA error then that's fine too, and will be the suspect LBA. Once you have the LBA(s), you can submit writes to them to get the drive to re-analyse them (assuming they're suspect): dd if=/dev/zero of=/dev/XXX bs=512 count=1 seek=N Where XXX is the device and N is the LBA number. If this works properly, the dd command should sit there for a little bit (as the drive does its re-analysis magic) and then should complete. ad2 is part of a gmirror with ad0. Does this change things? I haven't tried the dd yet. It does not change things, but I
Re: debugging frequent kernel panics on 8.2-RELEASE
- Original Message - From: Andriy Gapon a...@freebsd.org thanks for doing this! I'll reiterate my suspicion just in case - I think that you should look for the cases where you stop a jail, but then re-attach and resurrect the jail before it's completely dead. Yer that's where I think its happening too, but I also suspect its not just dieing jail that's needed, I think its a dieing jail in the final stages of cleanup. Looking through the code I believe I may have noticed a scenario which could trigger the problem. Given the following code:- static void prison_deref(struct prison *pr, int flags) { struct prison *ppr, *tpr; int vfslocked; if (!(flags PD_LOCKED)) mtx_lock(pr-pr_mtx); /* Decrement the user references in a separate loop. */ if (flags PD_DEUREF) { for (tpr = pr;; tpr = tpr-pr_parent) { if (tpr != pr) mtx_lock(tpr-pr_mtx); if (--tpr-pr_uref 0) break; KASSERT(tpr != prison0, (prison0 pr_uref=0)); mtx_unlock(tpr-pr_mtx); } /* Done if there were only user references to remove. */ if (!(flags PD_DEREF)) { mtx_unlock(tpr-pr_mtx); if (flags PD_LIST_SLOCKED) sx_sunlock(allprison_lock); else if (flags PD_LIST_XLOCKED) sx_xunlock(allprison_lock); return; } if (tpr != pr) { mtx_unlock(tpr-pr_mtx); mtx_lock(pr-pr_mtx); } } If you take a scenario of a simple one level prison setup running a single process where a prison has just been stopped. In the above code pr_uref of the processes prison is decremented. As this is the last process then pr_uref will hit 0 and the loop continues instead of breaking early. Now at the end of the loop iteration the mtx is unlocked so other process can now manipulate the jail, this is where I think the problem may be. If we now have another process come in and attach to the jail but then instantly exit, this process may allow another kernel thread to hit this same bit of code and so two process for the same prison get into the section which decrements prison0's pr_uref, instead of only one. In essence I think we can get the following flow where 1# = process1 and 2# = process2 1#1. prison1.pr_uref = 1 (single process jail) 1#2. prison_deref( prison1,... 1#3. prison1.pr_uref-- (prison1.pr_uref = 0) 1#3. prison1.mtx_unlock -- this now allows others to change prison1.pr_uref 1#3. prison0.pr_uref-- 2#1. process1.attach( prison1 ) (prison1.pr_uref = 1) 2#2. process1.exit 2#3. prison_deref( prison1,... 2#4. prison1.pr_uref-- (prison1.pr_uref = 0) 2#5. prison1.mtx_unlock -- this now allows others to change prison1.pr_uref 2#5. prison0.pr_uref-- (prison1.pr_ref has now been decremented twice by prison1) It seems like the action on the parent prison to decrement the pr_uref is happening too early, while the jail can still be used and without the lock on the child jails mtx, so causing a race condition. I think the fix is to the move the decrement of parent prison pr_uref's down so it only takes place if the jail is really being removed. Either that or to change the locking semantics so that once the lock is aquired in this prison_deref its not unlocked until the function completes. What do people think? Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
On Aug 20, 2011, at 3:57 PM, Jeremy Chadwick wrote: I still suggest you replace the drive, although given its age I doubt you'll be able to find a suitable replacement. I tend to keep disks like this around for testing/experimental purposes and not for actual use. I have several unused 80GB HDD I can place into this system. I think that's what I'll wind up doing. But I'd like to follow this process through and get it documented for future reference. Yes, given the behaviour of the drive I would recommend you simply replace it at this point in time. What concerns me the most is Current_Pending_Sector incrementing, but it's impossible for me to determine if that incrementing means there are other LBAs which are bad, or if the drive is behaving how its firmware is designed. Keep the drive around for further experiments/tinkering if you're interested. Stuff like this is always interesting/fun as long as your data isn't at risk, so doing the replacement first would be best (especially if both drives in your mirror were bought at the same time from the same place and have similar manufacturing plants/dates on them). I'm happy to send you this drive for your experimentation pleasure. If so, please email me an address offline. You don't have a disk with errors, and it seems you should have one. After I wipe it. I'm sure I have a destroyer CD here somewhere -- Dan Langille - http://langille.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
A follow-up given that I just viewed the SMART attribute data at the very bottom of this page as of this writing (Sat Aug 20 13:00:09 PDT 2011): http://beta.freebsddiary.org/smart-fixing-bad-sector.php And I see this: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 020Pre-fail Always - 2 9 Power_On_Hours 0x0012 059 059 001Old_age Always - 27440 196 Reallocated_Event_Count 0x0010 099 099 020Old_age Offline - 1 197 Current_Pending_Sector 0x0032 100 100 020Old_age Always - 2 198 Offline_Uncorrectable 0x0010 100 253 000Old_age Offline - 0 These attributes USUALLY mean: 1) Reallocated_Sector_Ct == There are 2 remapped LBAs. 2) Reallocated_Event_Count == There is 1 remapping event which has been noticed (either failure or success). 3) Current_Pending_Sector == There are 2 LBAs which are suspect. Now, given my previous statement about this particular model of drive, Maxtor may have a firmware quirk or other oddities that don't cause Current_Pending_Sector to drop to 0 or Reallocated_Event_Count to match reality. I simply don't know. But keep reading. And remember, this is what we started with: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 020Pre-fail Always - 1 9 Power_On_Hours 0x0012 059 059 001Old_age Always - 27416 196 Reallocated_Event_Count 0x0010 100 100 020Old_age Offline - 0 197 Current_Pending_Sector 0x0032 100 100 020Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 253 000Old_age Offline - 0 Anyway, in the SMART error log, I see 3 entries (2 new ones since the last time I saw the web page): * Error 3 occurred at disk power-on lifetime: 27422 hours (1142 days + 14 hours) 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 * Error 2 occurred at disk power-on lifetime: 27421 hours (1142 days + 13 hours) 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 * Error 1 occurred at disk power-on lifetime: 27400 hours (1141 days + 16 hours) 40 59 18 e8 ef 54 e0 Error: UNC 24 sectors at LBA = 0x0054efe8 = 5566440 These are all for the same LBA -- 5566440. Error 1 was something we already saw on the page the first time. So where did the other two come from? Earlier on the web page I saw these commands being executed: sh ./bad_block_scan /dev/ad2 5566400 5566500 -- will hit bad LBA sh ./bad_block_scan /dev/ad2 5566000 5566500 -- will hit bad LBA sh ./bad_block_scan /dev/ad2 556 5566000 -- will not hit bad LBA sh ./bad_block_scan /dev/ad2 556 5566000 -- will not hit bad LBA So there's the explanation for the two newly-added entries in the SMART error log. I'm very surprised if bad_block_scan did not echo that it had encountered read errors on LBA 5566440. It should have, unless I left the script in some weird state. The commands to use to verify would be: dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566439 dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566440 dd if=/dev/ad2 of=/dev/null bs=512 count=1 skip=5566441 (I tend to check around that LBA area as well, just to make sure, that's why there's 3 commands with -1 and +1 LBAs). One of these should return an I/O error, unless the LBA has been remapped already, in which case it shouldn't. Finally, there's this very interesting piece of information in the SMART self-test log (not selective scan log, but the self-test log; meaning this was the result of smartctl -t long /dev/ad2 at some point): Num Test_DescriptionStatus Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offlineCompleted: read failure 90% 27416 786767 So it seems this is one of those drives which does do a surface scan on a long test. But that's interesting -- LBA 786767. If that's true, then issuing the same dd commands as above (but with skip changed appropriately) should return an I/O error as well. Naturally check the SMART error log for verification. So, it's possible that there are actually two bad LBAs on this drive -- LBA 5566440 and LBA 786767. I simply don't know about the latter, but the former is confirmed in the SMART error log. If either of these LBAs are the ones which Current_Pending_Sector is referring to, then writes to them should be sufficient to induce re-analysis. E.g.: dd if=/dev/zero of=/dev/ad2 bs=512 count=1 seek=5566440 dd if=/dev/zero of=/dev/ad2 bs=512 count=1 seek=786767 The offsets for seek (not skip!!!) should probably be based on what the dd reads done earlier would show. Unless of course what we're seeing is
Re: debugging frequent kernel panics on 8.2-RELEASE
- Original Message - From: Steven Hartland kill...@multiplay.co.uk Looking through the code I believe I may have noticed a scenario which could trigger the problem. Given the following code:- static void prison_deref(struct prison *pr, int flags) { struct prison *ppr, *tpr; int vfslocked; if (!(flags PD_LOCKED)) mtx_lock(pr-pr_mtx); /* Decrement the user references in a separate loop. */ if (flags PD_DEUREF) { for (tpr = pr;; tpr = tpr-pr_parent) { if (tpr != pr) mtx_lock(tpr-pr_mtx); if (--tpr-pr_uref 0) break; KASSERT(tpr != prison0, (prison0 pr_uref=0)); mtx_unlock(tpr-pr_mtx); } /* Done if there were only user references to remove. */ if (!(flags PD_DEREF)) { mtx_unlock(tpr-pr_mtx); if (flags PD_LIST_SLOCKED) sx_sunlock(allprison_lock); else if (flags PD_LIST_XLOCKED) sx_xunlock(allprison_lock); return; } if (tpr != pr) { mtx_unlock(tpr-pr_mtx); mtx_lock(pr-pr_mtx); } } If you take a scenario of a simple one level prison setup running a single process where a prison has just been stopped. In the above code pr_uref of the processes prison is decremented. As this is the last process then pr_uref will hit 0 and the loop continues instead of breaking early. Now at the end of the loop iteration the mtx is unlocked so other process can now manipulate the jail, this is where I think the problem may be. If we now have another process come in and attach to the jail but then instantly exit, this process may allow another kernel thread to hit this same bit of code and so two process for the same prison get into the section which decrements prison0's pr_uref, instead of only one. In essence I think we can get the following flow where 1# = process1 and 2# = process2 1#1. prison1.pr_uref = 1 (single process jail) 1#2. prison_deref( prison1,... 1#3. prison1.pr_uref-- (prison1.pr_uref = 0) 1#3. prison1.mtx_unlock -- this now allows others to change prison1.pr_uref 1#3. prison0.pr_uref-- 2#1. process1.attach( prison1 ) (prison1.pr_uref = 1) 2#2. process1.exit 2#3. prison_deref( prison1,... 2#4. prison1.pr_uref-- (prison1.pr_uref = 0) 2#5. prison1.mtx_unlock -- this now allows others to change prison1.pr_uref 2#5. prison0.pr_uref-- (prison1.pr_ref has now been decremented twice by prison1) It seems like the action on the parent prison to decrement the pr_uref is happening too early, while the jail can still be used and without the lock on the child jails mtx, so causing a race condition. I think the fix is to the move the decrement of parent prison pr_uref's down so it only takes place if the jail is really being removed. Either that or to change the locking semantics so that once the lock is aquired in this prison_deref its not unlocked until the function completes. What do people think? After reviewing the changes to prison_deref in commit which added hierarchical jails, the removal of the lock by the inital loop on the passed in prison may be unintentional. http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_jail.c.diff?r1=1.101;r2=1.102;f=h If so the following may be all that's needed to fix this issue:- diff -u sys/kern/kern_jail.c.orig sys/kern/kern_jail.c --- sys/kern/kern_jail.c.orig 2011-08-20 21:17:14.856618854 +0100 +++ sys/kern/kern_jail.c2011-08-20 21:18:35.307201425 +0100 @@ -2455,7 +2455,8 @@ if (--tpr-pr_uref 0) break; KASSERT(tpr != prison0, (prison0 pr_uref=0)); - mtx_unlock(tpr-pr_mtx); + if (tpr != pr) + mtx_unlock(tpr-pr_mtx); } /* Done if there were only user references to remove. */ if (!(flags PD_DEREF)) { Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
on 20/08/2011 23:24 Steven Hartland said the following: - Original Message - From: Steven Hartland kill...@multiplay.co.uk Looking through the code I believe I may have noticed a scenario which could trigger the problem. Given the following code:- static void prison_deref(struct prison *pr, int flags) { struct prison *ppr, *tpr; int vfslocked; if (!(flags PD_LOCKED)) mtx_lock(pr-pr_mtx); /* Decrement the user references in a separate loop. */ if (flags PD_DEUREF) { for (tpr = pr;; tpr = tpr-pr_parent) { if (tpr != pr) mtx_lock(tpr-pr_mtx); if (--tpr-pr_uref 0) break; KASSERT(tpr != prison0, (prison0 pr_uref=0)); mtx_unlock(tpr-pr_mtx); } /* Done if there were only user references to remove. */ if (!(flags PD_DEREF)) { mtx_unlock(tpr-pr_mtx); if (flags PD_LIST_SLOCKED) sx_sunlock(allprison_lock); else if (flags PD_LIST_XLOCKED) sx_xunlock(allprison_lock); return; } if (tpr != pr) { mtx_unlock(tpr-pr_mtx); mtx_lock(pr-pr_mtx); } } If you take a scenario of a simple one level prison setup running a single process where a prison has just been stopped. In the above code pr_uref of the processes prison is decremented. As this is the last process then pr_uref will hit 0 and the loop continues instead of breaking early. Now at the end of the loop iteration the mtx is unlocked so other process can now manipulate the jail, this is where I think the problem may be. If we now have another process come in and attach to the jail but then instantly exit, this process may allow another kernel thread to hit this same bit of code and so two process for the same prison get into the section which decrements prison0's pr_uref, instead of only one. In essence I think we can get the following flow where 1# = process1 and 2# = process2 1#1. prison1.pr_uref = 1 (single process jail) 1#2. prison_deref( prison1,... 1#3. prison1.pr_uref-- (prison1.pr_uref = 0) 1#3. prison1.mtx_unlock -- this now allows others to change prison1.pr_uref 1#3. prison0.pr_uref-- 2#1. process1.attach( prison1 ) (prison1.pr_uref = 1) 2#2. process1.exit 2#3. prison_deref( prison1,... 2#4. prison1.pr_uref-- (prison1.pr_uref = 0) 2#5. prison1.mtx_unlock -- this now allows others to change prison1.pr_uref 2#5. prison0.pr_uref-- (prison1.pr_ref has now been decremented twice by prison1) It seems like the action on the parent prison to decrement the pr_uref is happening too early, while the jail can still be used and without the lock on the child jails mtx, so causing a race condition. I think the fix is to the move the decrement of parent prison pr_uref's down so it only takes place if the jail is really being removed. Either that or to change the locking semantics so that once the lock is aquired in this prison_deref its not unlocked until the function completes. What do people think? After reviewing the changes to prison_deref in commit which added hierarchical jails, the removal of the lock by the inital loop on the passed in prison may be unintentional. http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_jail.c.diff?r1=1.101;r2=1.102;f=h If so the following may be all that's needed to fix this issue:- diff -u sys/kern/kern_jail.c.orig sys/kern/kern_jail.c --- sys/kern/kern_jail.c.orig 2011-08-20 21:17:14.856618854 +0100 +++ sys/kern/kern_jail.c2011-08-20 21:18:35.307201425 +0100 @@ -2455,7 +2455,8 @@ if (--tpr-pr_uref 0) break; KASSERT(tpr != prison0, (prison0 pr_uref=0)); - mtx_unlock(tpr-pr_mtx); + if (tpr != pr) + mtx_unlock(tpr-pr_mtx); } /* Done if there were only user references to remove. */ if (!(flags PD_DEREF)) { Not sure if this would fly as is - please double check the later block where pr-pr_mtx is re-locked. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
- Original Message - From: Andriy Gapon a...@freebsd.org diff -u sys/kern/kern_jail.c.orig sys/kern/kern_jail.c --- sys/kern/kern_jail.c.orig 2011-08-20 21:17:14.856618854 +0100 +++ sys/kern/kern_jail.c2011-08-20 21:18:35.307201425 +0100 @@ -2455,7 +2455,8 @@ if (--tpr-pr_uref 0) break; KASSERT(tpr != prison0, (prison0 pr_uref=0)); - mtx_unlock(tpr-pr_mtx); + if (tpr != pr) + mtx_unlock(tpr-pr_mtx); } /* Done if there were only user references to remove. */ if (!(flags PD_DEREF)) { Not sure if this would fly as is - please double check the later block where pr-pr_mtx is re-locked. Will do, I'm now 99.9% sure this is the problem and even better I now have a reproducible scenario :) Something else you many be more interested in Andriy:- I added in debugging options DDB INVARIANTS to see if I can get a more useful info and the panic results in a looping panic constantly scrolling up the console. Not sure if this is a side effect of the patches we've been trying. Going to see if I can confirm that, lmk if there's something you want me to try? Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
- Original Message - From: Steven Hartland kill...@multiplay.co.uk Something else you many be more interested in Andriy:- I added in debugging options DDB INVARIANTS to see if I can get a more useful info and the panic results in a looping panic constantly scrolling up the console. Not sure if this is a side effect of the patches we've been trying. Going to see if I can confirm that, lmk if there's something you want me to try? Seems the stop_scheduler_on_panic.8.x.patch is the cause of this. Removing it allows me to drop to ddb when the panic due to the KASSERT happens. Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: debugging frequent kernel panics on 8.2-RELEASE
- Original Message - From: Andriy Gapon a...@freebsd.org on 20/08/2011 23:24 Steven Hartland said the following: - Original Message - From: Steven Hartland Looking through the code I believe I may have noticed a scenario which could trigger the problem. Given the following code:- static void prison_deref(struct prison *pr, int flags) { struct prison *ppr, *tpr; int vfslocked; if (!(flags PD_LOCKED)) mtx_lock(pr-pr_mtx); /* Decrement the user references in a separate loop. */ if (flags PD_DEUREF) { for (tpr = pr;; tpr = tpr-pr_parent) { if (tpr != pr) mtx_lock(tpr-pr_mtx); if (--tpr-pr_uref 0) break; KASSERT(tpr != prison0, (prison0 pr_uref=0)); mtx_unlock(tpr-pr_mtx); } /* Done if there were only user references to remove. */ if (!(flags PD_DEREF)) { mtx_unlock(tpr-pr_mtx); if (flags PD_LIST_SLOCKED) sx_sunlock(allprison_lock); else if (flags PD_LIST_XLOCKED) sx_xunlock(allprison_lock); return; } if (tpr != pr) { mtx_unlock(tpr-pr_mtx); mtx_lock(pr-pr_mtx); } } If you take a scenario of a simple one level prison setup running a single process where a prison has just been stopped. In the above code pr_uref of the processes prison is decremented. As this is the last process then pr_uref will hit 0 and the loop continues instead of breaking early. Now at the end of the loop iteration the mtx is unlocked so other process can now manipulate the jail, this is where I think the problem may be. If we now have another process come in and attach to the jail but then instantly exit, this process may allow another kernel thread to hit this same bit of code and so two process for the same prison get into the section which decrements prison0's pr_uref, instead of only one. In essence I think we can get the following flow where 1# = process1 and 2# = process2 1#1. prison1.pr_uref = 1 (single process jail) 1#2. prison_deref( prison1,... 1#3. prison1.pr_uref-- (prison1.pr_uref = 0) 1#3. prison1.mtx_unlock -- this now allows others to change prison1.pr_uref 1#3. prison0.pr_uref-- 2#1. process1.attach( prison1 ) (prison1.pr_uref = 1) 2#2. process1.exit 2#3. prison_deref( prison1,... 2#4. prison1.pr_uref-- (prison1.pr_uref = 0) 2#5. prison1.mtx_unlock -- this now allows others to change prison1.pr_uref 2#5. prison0.pr_uref-- (prison1.pr_ref has now been decremented twice by prison1) It seems like the action on the parent prison to decrement the pr_uref is happening too early, while the jail can still be used and without the lock on the child jails mtx, so causing a race condition. I think the fix is to the move the decrement of parent prison pr_uref's down so it only takes place if the jail is really being removed. Either that or to change the locking semantics so that once the lock is aquired in this prison_deref its not unlocked until the function completes. What do people think? After reviewing the changes to prison_deref in commit which added hierarchical jails, the removal of the lock by the inital loop on the passed in prison may be unintentional. http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_jail.c.diff?r1=1.101;r2=1.102;f=h If so the following may be all that's needed to fix this issue:- diff -u sys/kern/kern_jail.c.orig sys/kern/kern_jail.c --- sys/kern/kern_jail.c.orig 2011-08-20 21:17:14.856618854 +0100 +++ sys/kern/kern_jail.c2011-08-20 21:18:35.307201425 +0100 @@ -2455,7 +2455,8 @@ if (--tpr-pr_uref 0) break; KASSERT(tpr != prison0, (prison0 pr_uref=0)); - mtx_unlock(tpr-pr_mtx); + if (tpr != pr) + mtx_unlock(tpr-pr_mtx); } /* Done if there were only user references to remove. */ if (!(flags PD_DEREF)) { Not sure if this would fly as is - please double check the later block where pr-pr_mtx is re-locked. Your right, and its actually more complex than that. Although changing it to not unlock in the middle of prison_deref fixes that race condition it doesn't prevent pr_uref being incorrectly decremented each time the jail gets into the dying state, which is really the problem we are seeing. If hierarchical prisons are used there seems to be an additional problem where the counter of all prisons in the hierarchy are decremented, but as far as I can tell only the immediate parent is ever incremented, so another reference problem there as well I think. The following patch I believe fixes both of these issues. I've testing with debug added and confirmed prison0's pr_uref is maintained correctly even when a jail hits dying state multiple times. It essentially reverts the changes to the if (flags PD_DEUREF) by
Re: bad sector in gmirror HDD
Jeremy Chadwick free...@jdc.parodius.com wrote: ... using dd to find the bad LBAs is the only choice he has. or sysutils/diskcheckd. It uses a 64KB blocksize, falling back to 512 -- to identify the bad LBA(s) -- after getting a failure when reading a large block, and IME it runs something like 10x faster than dd with bs=64k. It would be advisable to check syslog configuration before using diskcheckd, since that is how it reports and there is reason to suspect that the as-shipped syslog.conf may discard at least some of diskcheckd's messages. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: bad sector in gmirror HDD
On Sun, Aug 21, 2011 at 02:00:33AM -0700, per...@pluto.rain.com wrote: Jeremy Chadwick free...@jdc.parodius.com wrote: ... using dd to find the bad LBAs is the only choice he has. or sysutils/diskcheckd. It uses a 64KB blocksize, falling back to 512 -- to identify the bad LBA(s) -- after getting a failure when reading a large block, and IME it runs something like 10x faster than dd with bs=64k. It would be advisable to check syslog configuration before using diskcheckd, since that is how it reports and there is reason to suspect that the as-shipped syslog.conf may discard at least some of diskcheckd's messages. That software has a major problem where it runs constantly, rather than periodically. I know because I'm the one who opened the PR on it: http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/115853 There's a discussion about this port/issue from a few days ago (how sweet!): http://lists.freebsd.org/pipermail/freebsd-ports/2011-August/069276.html With comments from you stating that the software is behaving as designed and that I misread the man page, but also stating point blank that either way the software runs continuously (which is what the PR was about in the first place): http://lists.freebsd.org/pipermail/freebsd-ports/2011-August/069321.html I closed the PR because when I left as a committer I no longer wanted to deal with the issue. I probably should have marked the PR as suspended, but either way it's an ordeal that needs to get dealt with; it absolutely should be re-opened in some way. Then there's this PR, which I fully agree should have *nothing* to do with gmirror, so I'm not even sure how to interpret what's written. Furthermore, the author of this PR commented in PR 115853 stating something completely different (read the first few lines very carefully/slowly -- it seems to indicate he agrees with my PR, but then opened up a separate PR with different wording): http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/143566 Back to my PR. I state that I set up diskcheckd.conf using the option you describe as a length of time over which to spread each pass, yet what happened was that it did as much I/O as it could (read the entire disk in 45 minutes) then proceeded to do it again (no sleep()). That is not the same thing as do I/O over the course of 7 days. Furthermore, the man page example gives this: EXAMPLES To check all of /dev/ad0 for errors once every two weeks, use this entry in diskcheckd.conf: /dev/ad0* 14 * Which is no different than what I specified in my PR other than that I used a value of 7 and the example uses 14. So what about the rest of the man page? The second format consists of four white space separated fields, which are the full pathname of the disk device, the size of that disk, the frequency in days at which to check that disk, and the rate in kilo- bytes per second at which to check this disk. Naturally, it would be contradictory to specify both the frequency and the rate, so only one of these should be specified. Additionally, the size of the disk should not be specified if the rate is specified, as this information is unneces- sary. I did not misread the man page, especially given what's in EXAMPLES. It's a bug somewhere -- either in the man page or the software itself. This software will burn through your drive constantly, unless you use the rate-in-kilobytes-per-second field. The frequency field doesn't work as advertised. And besides, such a utility really shouldn't be a daemon anyway but a periodic(8)-called utility with appropriate locks put in place to ensure more than one instance can't be run at once. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org