Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On 6/25/11 12:39 AM, Andrey Chernov wrote: On Fri, Jun 24, 2011 at 09:09:08PM -0400, Justin T. Gibbs wrote: No problem. I just set kern.geom.debugflags=4 in loader.conf and here is new photo (with recent kernel, no patches): http://img803.imageshack.us/img803/4679/25062011006.jpg I skip all noisy parts related to ada0 and ada1 partitions probes. As you can see, only 3 cd0-related geom call issued, right before cd1 probe shown. Strange thing is that I see no single cd1-related geom call, but it may be because of hang. The GEOM processing is serialized, so that is not unexpected. What your logs are telling me is that the probe for CD0 is hanging. I don't know why. Could you just postpone GEOM calls after any probe will be completed? It seems GEOM goes here even before probe and waits for probe forever. What probe waits in the same time is unclear for me (ccb_scan), but CD devices are slow and may not survive such multisleeping, missing some responses in the middle. The problem is not GEOM. It's not the thread waiting in ccb_scan - that thread is designed to wait there until an asynchronous device arrival/departure event occurs which is not the case here. The problem is in or below CAM, and that problem is causing the probe to never complete. Are you positive it is this specific SVN revision that prevents cd0 from probing properly and not one of my previous CAM commits? I use splitting by half method to find exact date which boots, then see the next commit above that date. Pre-commit kernel goes to multiuser and network is alive. I don't test CDs are working, I'll do that later and report it. So you know that revisions 223081, 223084, 223085, and 223089 all boot just fine? I committed five revisions on that date. 223099 just happens to be the last one for that day. -- Justin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Sat, Jun 25, 2011 at 07:27:20AM -0400, Justin T. Gibbs wrote: I use splitting by half method to find exact date which boots, then see the next commit above that date. Pre-commit kernel goes to multiuser and network is alive. I don't test CDs are working, I'll do that later and report it. So you know that revisions 223081, 223084, 223085, and 223089 all boot just fine? I committed five revisions on that date. 223099 just happens to be the last one for that day. OMG, I am very sorry to misleading you to wrong direction, and your time waste, give me pointy hat, please! It was r223081 which not boots. I.e. kernel cvsupped at src-sys date=2011.06.14.12.00.00 boots normally but cvsupped at src-sys date=2011.06.14.14.54.00 i.e. in the same minute after r223081 committed Tue Jun 14 14:53:17 2011 UTC hangs. I don't know how I may miss commits you mention(( -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Thu, Jun 23, 2011 at 11:01:17PM -0400, Justin T. Gibbs wrote: To test this theory, apply the following patch. I do not know if this is safe for changer devices, but I will review the changer code if this patch fixes ache's problem. I don't have changers. One of the plain ATA DVDs is read-only (cd0, which is not detected now) and another one is read-write (cd1, detected). I try the patch but nothing is changed in the picture. cd0 still not detected, xpt_thrd sleep in ccb_scan and g_event sleep in caplck. Here is probe from the old kernel which works, just in case, nothing unusual with it: cd0 at ata2 bus 0 scbus2 target 0 lun 0 cd0: ASUS DVD-E616A 1.08 Removable CD-ROM SCSI-0 device cd0: 100.000MB/s transfers (UDMA5, ATAPI 12bytes, PIO 65534bytes) cd0: Attempt to query device size failed: NOT READY, Medium not present - tray closed Unplugging power from it allows to boot normally. Inserting media does not help too. --- //depot/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c2011-05-07 10:06:43.0 -0600 +++ /home/justing/perforce/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c 2011-05-07 10:06:43.0 -0600 @@ -687,6 +687,10 @@ else softc-minimum_command_size = 6; + /* + * Refcount and block open attempts until we are setup + * Can't block + */ (void)cam_periph_hold(periph, PRIBIO); cam_periph_unlock(periph); /* @@ -747,7 +751,6 @@ softc-disk-d_hba_subdevice = cpi.hba_subdevice; disk_create(softc-disk, DISK_VERSION); cam_periph_lock(periph); - cam_periph_unhold(periph); /* * Add an async callback so that we get @@ -972,12 +975,6 @@ cdregisterexit: - /* - * Refcount and block open attempts until we are setup - * Can't block - */ - (void)cam_periph_hold(periph, PRIBIO); - if ((softc-flags CD_FLAG_CHANGER) == 0) xpt_schedule(periph, CAM_PRIORITY_DEV); else ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
I forget to mention that the place is the same as trace shows: g_event_procbody-g_run_events-g_new_provider_event-g_dev_taste- g_dev_attrchanged-g_access-g_disk_access-cdopen-cam_periph_hold and it sleeps. On Fri, Jun 24, 2011 at 05:08:27PM +0400, Andrey Chernov wrote: On Thu, Jun 23, 2011 at 11:01:17PM -0400, Justin T. Gibbs wrote: To test this theory, apply the following patch. I do not know if this is safe for changer devices, but I will review the changer code if this patch fixes ache's problem. I don't have changers. One of the plain ATA DVDs is read-only (cd0, which is not detected now) and another one is read-write (cd1, detected). I try the patch but nothing is changed in the picture. cd0 still not detected, xpt_thrd sleep in ccb_scan and g_event sleep in caplck. Here is probe from the old kernel which works, just in case, nothing unusual with it: cd0 at ata2 bus 0 scbus2 target 0 lun 0 cd0: ASUS DVD-E616A 1.08 Removable CD-ROM SCSI-0 device cd0: 100.000MB/s transfers (UDMA5, ATAPI 12bytes, PIO 65534bytes) cd0: Attempt to query device size failed: NOT READY, Medium not present - tray closed Unplugging power from it allows to boot normally. Inserting media does not help too. --- //depot/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c 2011-05-07 10:06:43.0 -0600 +++ /home/justing/perforce/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c 2011-05-07 10:06:43.0 -0600 @@ -687,6 +687,10 @@ else softc-minimum_command_size = 6; + /* +* Refcount and block open attempts until we are setup +* Can't block +*/ (void)cam_periph_hold(periph, PRIBIO); cam_periph_unlock(periph); /* @@ -747,7 +751,6 @@ softc-disk-d_hba_subdevice = cpi.hba_subdevice; disk_create(softc-disk, DISK_VERSION); cam_periph_lock(periph); - cam_periph_unhold(periph); /* * Add an async callback so that we get @@ -972,12 +975,6 @@ cdregisterexit: - /* -* Refcount and block open attempts until we are setup -* Can't block -*/ - (void)cam_periph_hold(periph, PRIBIO); - if ((softc-flags CD_FLAG_CHANGER) == 0) xpt_schedule(periph, CAM_PRIORITY_DEV); else ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Jun 23, 2011, at 11:18 PM, Scott Long wrote: On Jun 23, 2011, at 9:01 PM, Justin T. Gibbs wrote: On 6/22/11 4:09 PM, Kenneth D. Merry wrote: On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote: On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote: These two are interesting: http://img825.imageshack.us/img825/1249/21062011014m.jpg http://img839.imageshack.us/img839/3791/21062011015.jpg It looks like the GEOM event thread is stuck inside the cd(4) driver. The cd(4) driver is trying to acquire the peripheral lock, and is sleeping until it gets it. What isn't clear is who is holding it. ... The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So that tells me that one of several things is going on: - There is a path in the cd(4) driver where it can call cam_periph_hold() but not cam_periph_unhold(). - There is another thread in the system that has called cam_periph_hold(), and has gotten stuck before it can call cam_periph_unhold(). - The hold/unhold logic is broken, and there is a case where a thread waiting for the lock can miss the wakeup. After looking at the code, I don't think this is the case, but I may have missed something. So it is probably one of the first two cases. ... I have a theory for the cause of this hang. The commit that triggers this problem added calls to g_access() during the geom_dev probe. I believe this hit a race in cdregister() where the periph hold lock is dropped around the changer probe code. Why the periph hold lock is dropped there, I do not know as I haven't fully reviewed the changer probe code. Are you talking about this? disk_create(softc-disk, DISK_VERSION); cam_periph_lock(periph); cam_periph_unhold(periph); [...] if (((cgd-ccb_h.target_lun 0) ((softc-quirks CD_Q_NO_CHANGER) == 0)) || ((softc-quirks CD_Q_CHANGER) != 0)) { The unhold there compliments the hold that was done prior to disk_create(). The hold/unhold is done as a hack around the need to drop the periph/sim mutex while calling disk_create(), due to the later's insistence on using blocking calls. I've wanted to re-think how that pattern is done (it's the same gross hack in nearly all of the periph drivers), but haven't gotten around to it. If the 'hold' semaphore needs to be held longer to prevent the race that you're theorizing, then it should be possible to simply extend its coverage in the code block, but I'm not sure if it'll result in an unintended deadlock with the changer enumeration/matching code. I _think_ that it'll be ok, but the density of magic in the code is a bit overwhelming at this time of night =-) Actually, what's probably happening is that the sim/periph lock is being dropped but the hold semaphore is held, disk_create() is called, which kicks off GEOM to do GEOM-ish things including the new g_access() call. It tries to call cdopen(), which grabs the periph/sim mutex in order to then get the hold semaphore, and winds up sleeping because the semaphore is already held in cdregister(). If/when disk_create() returns, cdregister() winds up waiting for the periph lock because it's held by cdopen(), and now you have a deadlock between the two. Again, this is my fault for being lazy and not re-organzing the periph drivers so that disk_create() is safely called without shady locking hacks. I don't think that extending the coverage of the hold semaphore is going to help in this case. The periphs need to adopt a new pattern where disk_create() isn't called until the periph is completely initialized and ready for periph_open() to be called without any locking gymnastics. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On 6/24/11 3:35 PM, Scott Long wrote: On Jun 23, 2011, at 11:18 PM, Scott Long wrote: On Jun 23, 2011, at 9:01 PM, Justin T. Gibbs wrote: On 6/22/11 4:09 PM, Kenneth D. Merry wrote: On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote: On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote: These two are interesting: http://img825.imageshack.us/img825/1249/21062011014m.jpg http://img839.imageshack.us/img839/3791/21062011015.jpg It looks like the GEOM event thread is stuck inside the cd(4) driver. The cd(4) driver is trying to acquire the peripheral lock, and is sleeping until it gets it. What isn't clear is who is holding it. ... The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So that tells me that one of several things is going on: - There is a path in the cd(4) driver where it can call cam_periph_hold() but not cam_periph_unhold(). - There is another thread in the system that has called cam_periph_hold(), and has gotten stuck before it can call cam_periph_unhold(). - The hold/unhold logic is broken, and there is a case where a thread waiting for the lock can miss the wakeup. After looking at the code, I don't think this is the case, but I may have missed something. So it is probably one of the first two cases. ... I have a theory for the cause of this hang. The commit that triggers this problem added calls to g_access() during the geom_dev probe. I believe this hit a race in cdregister() where the periph hold lock is dropped around the changer probe code. Why the periph hold lock is dropped there, I do not know as I haven't fully reviewed the changer probe code. Are you talking about this? disk_create(softc-disk, DISK_VERSION); cam_periph_lock(periph); cam_periph_unhold(periph); [...] if (((cgd-ccb_h.target_lun 0) ((softc-quirks CD_Q_NO_CHANGER) == 0)) || ((softc-quirks CD_Q_CHANGER) != 0)) { The unhold there compliments the hold that was done prior to disk_create(). The hold/unhold is done as a hack around the need to drop the periph/sim mutex while calling disk_create(), due to the later's insistence on using blocking calls. I've wanted to re-think how that pattern is done (it's the same gross hack in nearly all of the periph drivers), but haven't gotten around to it. If the 'hold' semaphore needs to be held longer to prevent the race that you're theorizing, then it should be possible to simply extend its coverage in the code block, but I'm not sure if it'll result in an unintended deadlock with the changer enumeration/matching code. I _think_ that it'll be ok, but the density of magic in the code is a bit overwhelming at this time of night =-) The cd driver is the only one that temporarily drops the periph hold semaphore before scheduling and completing the probe processing. I think the above race could happen but that there is something else causing the current hang ache is experiencing. Actually, what's probably happening is that the sim/periph lock is being dropped but the hold semaphore is held, disk_create() is called, which kicks off GEOM to do GEOM-ish things including the new g_access() call. It tries to call cdopen(), which grabs the periph/sim mutex in order to then get the hold semaphore, and winds up sleeping because the semaphore is already held in cdregister(). If/when disk_create() returns, cdregister() winds up waiting for the periph lock because it's held by cdopen(), and now you have a deadlock between the two. The sim mutex is dropped while waiting for the periph hold semaphore, so I don't understand the deadlock you are describing. If there is still a traditional deadlock here, I'd expect to find a thread somewhere blocking on a mutex required to release the periph lock semaphore. Ache hasn't reported a substantive change to the ddb output, just that instead of g_eli blocked in open, it is now g_dev. Instead, I believe that either one of the GEOM taste methods is leaking an access reference (so cdclose() is not called), or the CD driver is failing to release the hold semaphore during probing. Setting kern.geom.debugflags to '4' will trace the access calls and allow the GEOM side to be ruled out. If GEOM is exonerated, we can add tracing to cam_perihp_(un)hold to track this down further. -- Justin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Fri, Jun 24, 2011 at 04:20:24PM -0400, Justin T. Gibbs wrote: Instead, I believe that either one of the GEOM taste methods is leaking an access reference (so cdclose() is not called), or the CD driver is failing to release the hold semaphore during probing. Setting kern.geom.debugflags to '4' will trace the access calls and allow the GEOM side to be ruled out. If GEOM is exonerated, we can add tracing to cam_perihp_(un)hold to track this down further. No problem. I just set kern.geom.debugflags=4 in loader.conf and here is new photo (with recent kernel, no patches): http://img803.imageshack.us/img803/4679/25062011006.jpg I skip all noisy parts related to ada0 and ada1 partitions probes. As you can see, only 3 cd0-related geom call issued, right before cd1 probe shown. Strange thing is that I see no single cd1-related geom call, but it may be because of hang. -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On 6/24/11 6:26 PM, Andrey Chernov wrote: On Fri, Jun 24, 2011 at 04:20:24PM -0400, Justin T. Gibbs wrote: Instead, I believe that either one of the GEOM taste methods is leaking an access reference (so cdclose() is not called), or the CD driver is failing to release the hold semaphore during probing. Setting kern.geom.debugflags to '4' will trace the access calls and allow the GEOM side to be ruled out. If GEOM is exonerated, we can add tracing to cam_perihp_(un)hold to track this down further. No problem. I just set kern.geom.debugflags=4 in loader.conf and here is new photo (with recent kernel, no patches): http://img803.imageshack.us/img803/4679/25062011006.jpg I skip all noisy parts related to ada0 and ada1 partitions probes. As you can see, only 3 cd0-related geom call issued, right before cd1 probe shown. Strange thing is that I see no single cd1-related geom call, but it may be because of hang. The GEOM processing is serialized, so that is not unexpected. What your logs are telling me is that the probe for CD0 is hanging. I don't know why. Are you positive it is this specific SVN revision that prevents cd0 from probing properly and not one of my previous CAM commits? Just getting to multi-user doesn't mean we're ok here. My GEOM changes may make the system hang earlier, but you'll need to test access to cd0 even if you get to multi-user mode to be sure that the device is functioning correctly. I just want to be positive that we're barking up the right tree. -- Justin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Fri, Jun 24, 2011 at 09:09:08PM -0400, Justin T. Gibbs wrote: No problem. I just set kern.geom.debugflags=4 in loader.conf and here is new photo (with recent kernel, no patches): http://img803.imageshack.us/img803/4679/25062011006.jpg I skip all noisy parts related to ada0 and ada1 partitions probes. As you can see, only 3 cd0-related geom call issued, right before cd1 probe shown. Strange thing is that I see no single cd1-related geom call, but it may be because of hang. The GEOM processing is serialized, so that is not unexpected. What your logs are telling me is that the probe for CD0 is hanging. I don't know why. Could you just postpone GEOM calls after any probe will be completed? It seems GEOM goes here even before probe and waits for probe forever. What probe waits in the same time is unclear for me (ccb_scan), but CD devices are slow and may not survive such multisleeping, missing some responses in the middle. Are you positive it is this specific SVN revision that prevents cd0 from probing properly and not one of my previous CAM commits? Just getting to multi-user doesn't mean we're ok here. My GEOM changes may make the system hang earlier, but you'll need to test access to cd0 even if you get to multi-user mode to be sure that the device is functioning correctly. I just want to be positive that we're barking up the right tree. I use splitting by half method to find exact date which boots, then see the next commit above that date. Pre-commit kernel goes to multiuser and network is alive. I don't test CDs are working, I'll do that later and report it. -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Sat, Jun 25, 2011 at 08:39:17AM +0400, Andrey Chernov wrote: Are you positive it is this specific SVN revision that prevents cd0 from probing properly and not one of my previous CAM commits? Just getting to multi-user doesn't mean we're ok here. My GEOM changes may make the system hang earlier, but you'll need to test access to cd0 even if you get to multi-user mode to be sure that the device is functioning correctly. I just want to be positive that we're barking up the right tree. I use splitting by half method to find exact date which boots, then see the next commit above that date. Pre-commit kernel goes to multiuser and network is alive. I don't test CDs are working, I'll do that later and report it. Just test it, kernel dated src-sys date=2011.06.14.12.00.00 Both DVDs are mounted normally and files are readable on both. Here is g_* debugging and cd* related parts from dmesg of working kernel (with few lines before/after to hint the place). As you can see, cd0-probe appears far later in dmesg, but cd0-geom (not much later cd1-geom) is earlier thing: ... ada0 at ahcich2 bus 0 scbus3 target 0 lun 0 ada0: SAMSUNG HD502HJ 1AJ10001 ATA-8 SATA 2.x device ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled g_access(0xc652fe40(cd0), 1, 0, 0) open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4500(cd0) g_disk_access(cd0, 1, 0, 0) ada0: 476940MB (976773168 512 byte sectors: 16H 63S/T 16383C) ada0: Previously was known as ad10 ada1 at ahcich3 bus 0 scbus4 target 0 lun 0 ada1: ST31500341AS CC1H ATA-8 SATA 2.x device ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada1: Command Queueing enabled ada1: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C) ada1: Previously was known as ad12 ... Timecounter TSC-low frequency 14077952 Hz quality 1000 WARNING: WITNESS option enabled, expect reduced performance. cd1 at ata2 bus 0 scbus2 target 1 lun 0 cd1: PLEXTOR DVDR PX-740A 1.02 Removable CD-ROM SCSI-0 device cd1: 33.300MB/s transfers (UDMA2, ATAPI 12bytes, PIO 65534bytes) cd1: Attempt to query device size failed: NOT READY, Medium not present uhub0: 2 ports with 2 removable, self powered ... uhub7: 6 ports with 6 removable, self powered cd0 at ata2 bus 0 scbus2 target 0 lun 0 cd0: ASUS DVD-E616A 1.08 Removable CD-ROM SCSI-0 device cd0: 100.000MB/s transfers (UDMA5, ATAPI 12bytes, PIO 65534bytes) cd0: Attempt to query device size failed: NOT READY, Medium not present - tray closed g_access(0xc652fe40(cd0), -1, 0, 0) open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4500(cd0) g_disk_access(cd0, -1, 0, 0) g_access(0xc652fe00(cd0), 1, 0, 0) open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4500(cd0) g_disk_access(cd0, 1, 0, 0) g_access(0xc652fe00(cd0), -1, 0, 0) open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4500(cd0) g_disk_access(cd0, -1, 0, 0) g_access(0xc652fdc0(cd0), 1, 0, 0) open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4500(cd0) g_disk_access(cd0, 1, 0, 0) g_access(0xc652fdc0(cd0), -1, 0, 0) open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4500(cd0) g_disk_access(cd0, -1, 0, 0) g_access(0xc652fd40(cd1), 1, 0, 0) open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4280(cd1) g_disk_access(cd1, 1, 0, 0) g_access(0xc652fd40(cd1), -1, 0, 0) open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4280(cd1) g_disk_access(cd1, -1, 0, 0) g_access(0xc652fd00(cd1), 1, 0, 0) open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4280(cd1) g_disk_access(cd1, 1, 0, 0) g_access(0xc652fd00(cd1), -1, 0, 0) open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4280(cd1) g_disk_access(cd1, -1, 0, 0) g_access(0xc652fcc0(cd1), 1, 0, 0) open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4280(cd1) g_disk_access(cd1, 1, 0, 0) g_access(0xc652fcc0(cd1), -1, 0, 0) open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4280(cd1) g_disk_access(cd1, -1, 0, 0) g_access(0xc6b1c040(ada0), 1, 0, 0) ... -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
on 22/06/2011 23:09 Kenneth D. Merry said the following: The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So that tells me that one of several things is going on: - There is a path in the cd(4) driver where it can call cam_periph_hold() but not cam_periph_unhold(). - There is another thread in the system that has called cam_periph_hold(), and has gotten stuck before it can call cam_periph_unhold(). - The hold/unhold logic is broken, and there is a case where a thread waiting for the lock can miss the wakeup. After looking at the code, I don't think this is the case, but I may have missed something. So it is probably one of the first two cases. From the dmesg, I only see cd1 listed, not cd0. So it is possible that cd0 is stuck in the probe code somewhere, and the geom code just gets stuck trying to open it when the probe hasn't completed. Seeing the stack trace for the taskq thread that is running on CPU 0 (process 100014) might be enlightening, it's hard to say. That may or may not show the issue. It's possible that this issue is directly related to the commit in question; perhaps there is an error being returned that wasn't returned before and it isn't being handled right in the cd(4) driver. (The cd(4) driver wasn't touched in the commit.) It's also possible that the commit in question just changed the timing and your system is hitting a race that was there previously. I have a suspicion that this is actually the case. More than once I've seen under qemu that the kernel boot non-deterministically gets stuck in the cd driver. Other people have also bumped into this. E.g., here's one of the reports that I googled up, it's not exactly the same as what ache has reported, but somewhat similar: http://lists.freebsd.org/pipermail/freebsd-current/2010-October/020336.html -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
Apparently there is another problem plain ATA CD/DVD related. With r223443 hangs nature is changed: I see no more waiting in caplck state, just xpt_thrd waiting in ccb_scan state forever and those repeated messages: run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config run_interrupt_driven_hooks: still waiting after 120 seconds for xpt_config ... and so on. On Wed, Jun 22, 2011 at 02:09:19PM -0600, Kenneth D. Merry wrote: On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote: On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote: These two are interesting: http://img825.imageshack.us/img825/1249/21062011014m.jpg http://img839.imageshack.us/img839/3791/21062011015.jpg It looks like the GEOM event thread is stuck inside the cd(4) driver. The cd(4) driver is trying to acquire the peripheral lock, and is sleeping until it gets it. What isn't clear is who is holding it. The ps output shows an idle thread running on CPU 1, and thread 100014 (taskq) running on CPU 0. Unfortunately I don't see a stack trace for that. (I might have missed it.) Do you happen to have the image with the stack trace for that thread? I don't have the image because no disks are mounted at that stage and the swap slice is not attached. But I can issue more specific DDB commands to narrow it down, just say what you need in detail. BTW, the machine have 2 DVD both are attached to Marvell IDE plain ATA interface, they always works before. Are you sure that something holding the lock? 'show lock' shows absolutely nothing, it is empty. Well, after looking at the code a little more, it looks like the lock that is being held is the periph lock, which is really just a flag. So 'show lock' wouldn't show anything relevant. Here's cam_periph_hold(): int cam_periph_hold(struct cam_periph *periph, int priority) { int error; /* * Increment the reference count on the peripheral * while we wait for our lock attempt to succeed * to ensure the peripheral doesn't disappear out * from user us while we sleep. */ if (cam_periph_acquire(periph) != CAM_REQ_CMP) return (ENXIO); mtx_assert(periph-sim-mtx, MA_OWNED); while ((periph-flags CAM_PERIPH_LOCKED) != 0) { periph-flags |= CAM_PERIPH_LOCK_WANTED; if ((error = mtx_sleep(periph, periph-sim-mtx, priority, caplck, 0)) != 0) { cam_periph_release_locked(periph); return (error); } } periph-flags |= CAM_PERIPH_LOCKED; return (0); } The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So that tells me that one of several things is going on: - There is a path in the cd(4) driver where it can call cam_periph_hold() but not cam_periph_unhold(). - There is another thread in the system that has called cam_periph_hold(), and has gotten stuck before it can call cam_periph_unhold(). - The hold/unhold logic is broken, and there is a case where a thread waiting for the lock can miss the wakeup. After looking at the code, I don't think this is the case, but I may have missed something. So it is probably one of the first two cases. From the dmesg, I only see cd1 listed, not cd0. So it is possible that cd0 is stuck in the probe code somewhere, and the geom code just gets stuck trying to open it when the probe hasn't completed. Seeing the stack trace for the taskq thread that is running on CPU 0 (process 100014) might be enlightening, it's hard to say. That may or may not show the issue. It's possible that this issue is directly related to the commit in question; perhaps there is an error being returned that wasn't returned before and it isn't being handled right in the cd(4) driver. (The cd(4) driver wasn't touched in the commit.) It's also possible that the commit in question just changed the timing and your system is hitting a race that was there previously. Ken -- Kenneth Merry k...@freebsd.org -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Thu, Jun 23, 2011 at 03:51:36PM +0300, Andriy Gapon wrote: More than once I've seen under qemu that the kernel boot non-deterministically gets stuck in the cd driver. Other people have also bumped into this. E.g., here's one of the reports that I googled up, it's not exactly the same as what ache has reported, but somewhat similar: http://lists.freebsd.org/pipermail/freebsd-current/2010-October/020336.html Thread named as xpt_thrd might be of particular interest. See my reply to ken (after applying recent r223443), caplck seems gone, but xpt_thrd waiting forever in ccb_scan with repeated messages: run_interrupt_driven_hooks: still waiting after XXX seconds for xpt_config -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Wed, Jun 22, 2011 at 02:09:19PM -0600, Kenneth D. Merry wrote: Well, after looking at the code a little more, it looks like the lock that is being held is the periph lock, which is really just a flag. So 'show lock' wouldn't show anything relevant. Here's cam_periph_hold(): With recent r223475 situation returned to what you describe (and still no cd0 in probe). As tracing 18 show, it was cdopen() who calls cam_periph_hold() and sleeps there forever. Tracing 100014 as you suggest shows almost nothing: only fork_trampoline() is there. -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On 6/22/11 4:09 PM, Kenneth D. Merry wrote: On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote: On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote: These two are interesting: http://img825.imageshack.us/img825/1249/21062011014m.jpg http://img839.imageshack.us/img839/3791/21062011015.jpg It looks like the GEOM event thread is stuck inside the cd(4) driver. The cd(4) driver is trying to acquire the peripheral lock, and is sleeping until it gets it. What isn't clear is who is holding it. ... The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So that tells me that one of several things is going on: - There is a path in the cd(4) driver where it can call cam_periph_hold() but not cam_periph_unhold(). - There is another thread in the system that has called cam_periph_hold(), and has gotten stuck before it can call cam_periph_unhold(). - The hold/unhold logic is broken, and there is a case where a thread waiting for the lock can miss the wakeup. After looking at the code, I don't think this is the case, but I may have missed something. So it is probably one of the first two cases. ... I have a theory for the cause of this hang. The commit that triggers this problem added calls to g_access() during the geom_dev probe. I believe this hit a race in cdregister() where the periph hold lock is dropped around the changer probe code. Why the periph hold lock is dropped there, I do not know as I haven't fully reviewed the changer probe code. The drop of the lock in cdregister() can allow geom classes to probe and thus call g_access()-g_disk_access()-cdopen() before a probe is initiated in the normal way by cdregister(). cdopen() checks for media presence by issuing immediate ccds. When the race is exploited, the peripheral will be in the probe state when the immediate ccbs are requested. This will cause the device probe to be performed before the immediate ccd is returned. When the cdopen() activity finally unwinds, cdregister() will again take the periph hold lock and schedule the peripheral, expecting probe processing to complete and release the hold lock. However, since the periph is already in the normal state (due to the successful probe performed indirectly by the cdopen() call), that unlock never happens, thus wedging the device. To test this theory, apply the following patch. I do not know if this is safe for changer devices, but I will review the changer code if this patch fixes ache's problem. -- Justin --- //depot/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c 2011-05-07 10:06:43.0 -0600 +++ /home/justing/perforce/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c 2011-05-07 10:06:43.0 -0600 @@ -687,6 +687,10 @@ else softc-minimum_command_size = 6; + /* +* Refcount and block open attempts until we are setup +* Can't block +*/ (void)cam_periph_hold(periph, PRIBIO); cam_periph_unlock(periph); /* @@ -747,7 +751,6 @@ softc-disk-d_hba_subdevice = cpi.hba_subdevice; disk_create(softc-disk, DISK_VERSION); cam_periph_lock(periph); - cam_periph_unhold(periph); /* * Add an async callback so that we get @@ -972,12 +975,6 @@ cdregisterexit: - /* -* Refcount and block open attempts until we are setup -* Can't block -*/ - (void)cam_periph_hold(periph, PRIBIO); - if ((softc-flags CD_FLAG_CHANGER) == 0) xpt_schedule(periph, CAM_PRIORITY_DEV); else ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Jun 23, 2011, at 9:01 PM, Justin T. Gibbs wrote: On 6/22/11 4:09 PM, Kenneth D. Merry wrote: On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote: On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote: These two are interesting: http://img825.imageshack.us/img825/1249/21062011014m.jpg http://img839.imageshack.us/img839/3791/21062011015.jpg It looks like the GEOM event thread is stuck inside the cd(4) driver. The cd(4) driver is trying to acquire the peripheral lock, and is sleeping until it gets it. What isn't clear is who is holding it. ... The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So that tells me that one of several things is going on: - There is a path in the cd(4) driver where it can call cam_periph_hold() but not cam_periph_unhold(). - There is another thread in the system that has called cam_periph_hold(), and has gotten stuck before it can call cam_periph_unhold(). - The hold/unhold logic is broken, and there is a case where a thread waiting for the lock can miss the wakeup. After looking at the code, I don't think this is the case, but I may have missed something. So it is probably one of the first two cases. ... I have a theory for the cause of this hang. The commit that triggers this problem added calls to g_access() during the geom_dev probe. I believe this hit a race in cdregister() where the periph hold lock is dropped around the changer probe code. Why the periph hold lock is dropped there, I do not know as I haven't fully reviewed the changer probe code. Are you talking about this? disk_create(softc-disk, DISK_VERSION); cam_periph_lock(periph); cam_periph_unhold(periph); [...] if (((cgd-ccb_h.target_lun 0) ((softc-quirks CD_Q_NO_CHANGER) == 0)) || ((softc-quirks CD_Q_CHANGER) != 0)) { The unhold there compliments the hold that was done prior to disk_create(). The hold/unhold is done as a hack around the need to drop the periph/sim mutex while calling disk_create(), due to the later's insistence on using blocking calls. I've wanted to re-think how that pattern is done (it's the same gross hack in nearly all of the periph drivers), but haven't gotten around to it. If the 'hold' semaphore needs to be held longer to prevent the race that you're theorizing, then it should be possible to simply extend its coverage in the code block, but I'm not sure if it'll result in an unintended deadlock with the changer enumeration/matching code. I _think_ that it'll be ok, but the density of magic in the code is a bit overwhelming at this time of night =-) Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Tue, Jun 21, 2011 at 9:51 PM, Andrey Chernov a...@freebsd.org wrote: Remove cd/acd from your kernel config to see if that allows you to boot? I unplug DVDs physically and kernel finally boots! BTW both DVDs was empty during the hanged boot and works normally under Win7. Put a DVD in each of the drives, like so: Case 1: - Drive one populated. - Drive two empty. Case 2: - Drive one empty. - Drive two populated. Case 3: - Drive one populated. - Drive two populated. See what happens. I've run into issues with FreeBSD in the past with some motherboards / drives / BIOSes where things were hung up unless I put a DVD in my drive (or vice versa) because of cd(4)/cam(4). Thanks, -Garrett ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Tue, Jun 21, 2011 at 11:19:38PM -0700, Garrett Cooper wrote: On Tue, Jun 21, 2011 at 9:51 PM, Andrey Chernov a...@freebsd.org wrote: Remove cd/acd from your kernel config to see if that allows you to boot? I unplug DVDs physically and kernel finally boots! BTW both DVDs was empty during the hanged boot and works normally under Win7. Put a DVD in each of the drives, like so: Case 1: - Drive one populated. - Drive two empty. Case 2: - Drive one empty. - Drive two populated. Case 3: - Drive one populated. - Drive two populated. See what happens. I've run into issues with FreeBSD in the past with some motherboards / drives / BIOSes where things were hung up unless I put a DVD in my drive (or vice versa) because of cd(4)/cam(4). I test both DVD inside case and it hangs. Moreover, with both DVD inside DVD devices are even not detected at all as previously (the machine hangs even earlier). And I have messages I don't see before: run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config run_interrupt_driven_hooks: still waiting after 120 seconds for xpt_config ... and so on. -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
I forget to add that only 1 disk inside each drive doesn't change this picture at all. On Wed, Jun 22, 2011 at 10:42:30AM +0400, Andrey Chernov wrote: On Tue, Jun 21, 2011 at 11:19:38PM -0700, Garrett Cooper wrote: On Tue, Jun 21, 2011 at 9:51 PM, Andrey Chernov a...@freebsd.org wrote: Remove cd/acd from your kernel config to see if that allows you to boot? I unplug DVDs physically and kernel finally boots! BTW both DVDs was empty during the hanged boot and works normally under Win7. Put a DVD in each of the drives, like so: Case 1: - Drive one populated. - Drive two empty. Case 2: - Drive one empty. - Drive two populated. Case 3: - Drive one populated. - Drive two populated. See what happens. I've run into issues with FreeBSD in the past with some motherboards / drives / BIOSes where things were hung up unless I put a DVD in my drive (or vice versa) because of cd(4)/cam(4). I test both DVD inside case and it hangs. Moreover, with both DVD inside DVD devices are even not detected at all as previously (the machine hangs even earlier). And I have messages I don't see before: run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config run_interrupt_driven_hooks: still waiting after 120 seconds for xpt_config ... and so on. -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote: On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote: These two are interesting: http://img825.imageshack.us/img825/1249/21062011014m.jpg http://img839.imageshack.us/img839/3791/21062011015.jpg It looks like the GEOM event thread is stuck inside the cd(4) driver. The cd(4) driver is trying to acquire the peripheral lock, and is sleeping until it gets it. What isn't clear is who is holding it. The ps output shows an idle thread running on CPU 1, and thread 100014 (taskq) running on CPU 0. Unfortunately I don't see a stack trace for that. (I might have missed it.) Do you happen to have the image with the stack trace for that thread? I don't have the image because no disks are mounted at that stage and the swap slice is not attached. But I can issue more specific DDB commands to narrow it down, just say what you need in detail. BTW, the machine have 2 DVD both are attached to Marvell IDE plain ATA interface, they always works before. Are you sure that something holding the lock? 'show lock' shows absolutely nothing, it is empty. Well, after looking at the code a little more, it looks like the lock that is being held is the periph lock, which is really just a flag. So 'show lock' wouldn't show anything relevant. Here's cam_periph_hold(): int cam_periph_hold(struct cam_periph *periph, int priority) { int error; /* * Increment the reference count on the peripheral * while we wait for our lock attempt to succeed * to ensure the peripheral doesn't disappear out * from user us while we sleep. */ if (cam_periph_acquire(periph) != CAM_REQ_CMP) return (ENXIO); mtx_assert(periph-sim-mtx, MA_OWNED); while ((periph-flags CAM_PERIPH_LOCKED) != 0) { periph-flags |= CAM_PERIPH_LOCK_WANTED; if ((error = mtx_sleep(periph, periph-sim-mtx, priority, caplck, 0)) != 0) { cam_periph_release_locked(periph); return (error); } } periph-flags |= CAM_PERIPH_LOCKED; return (0); } The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So that tells me that one of several things is going on: - There is a path in the cd(4) driver where it can call cam_periph_hold() but not cam_periph_unhold(). - There is another thread in the system that has called cam_periph_hold(), and has gotten stuck before it can call cam_periph_unhold(). - The hold/unhold logic is broken, and there is a case where a thread waiting for the lock can miss the wakeup. After looking at the code, I don't think this is the case, but I may have missed something. So it is probably one of the first two cases. From the dmesg, I only see cd1 listed, not cd0. So it is possible that cd0 is stuck in the probe code somewhere, and the geom code just gets stuck trying to open it when the probe hasn't completed. Seeing the stack trace for the taskq thread that is running on CPU 0 (process 100014) might be enlightening, it's hard to say. That may or may not show the issue. It's possible that this issue is directly related to the commit in question; perhaps there is an error being returned that wasn't returned before and it isn't being handled right in the cd(4) driver. (The cd(4) driver wasn't touched in the commit.) It's also possible that the commit in question just changed the timing and your system is hitting a race that was there previously. Ken -- Kenneth Merry k...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Wed, Jun 22, 2011 at 00:49:34 +0400, Andrey Chernov wrote: On Tue, Jun 21, 2011 at 10:17:19AM -0600, Kenneth D. Merry wrote: ps alltrace show locks show msgbuf Hopefully that will give us something to start looking at... This would really work a lot better if there is any way to get a serial console on the machine. The above will produce a good bit of output, and would likely need a lot of pictures. Since we can't reproduce the problem here, some debugging help would be greatly appreciated. Sorry I have no serial console. Here are the photos. I remove very similar looking USB parts from 'ps' and 'alltrace', and very general parts from 'alltrace' always been there. I hope remaining info will be enough. USB hotplagging works at this stage, so no reason to look there. If it will be not enough, I'll upload whole series. Thanks for uploading all of the photos. That's a lot of work, but they are helpful... I think I see part of the problem, but not the whole problem: 'show lock' outputs nothing, it means no locks just sleep somewhere forever. 'ps': http://img43.imageshack.us/img43/1424/21062011001j.jpg http://img835.imageshack.us/img835/6607/21062011002.jpg http://img841.imageshack.us/img841/5401/21062011003.jpg 'alltrace': http://img864.imageshack.us/img864/6757/21062011004ya.jpg http://img542.imageshack.us/img542/4857/21062011005.jpg http://img828.imageshack.us/img828/823/21062011006.jpg http://img5.imageshack.us/img5/910/21062011007.jpg http://img7.imageshack.us/img7/4704/21062011008.jpg http://img848.imageshack.us/img848/5487/21062011009.jpg http://img641.imageshack.us/img641/2/21062011010.jpg http://img7.imageshack.us/img7/7946/21062011011.jpg http://img860.imageshack.us/img860/8185/21062011012.jpg http://img696.imageshack.us/img696/5276/21062011013.jpg These two are interesting: http://img825.imageshack.us/img825/1249/21062011014m.jpg http://img839.imageshack.us/img839/3791/21062011015.jpg It looks like the GEOM event thread is stuck inside the cd(4) driver. The cd(4) driver is trying to acquire the peripheral lock, and is sleeping until it gets it. What isn't clear is who is holding it. The ps output shows an idle thread running on CPU 1, and thread 100014 (taskq) running on CPU 0. Unfortunately I don't see a stack trace for that. (I might have missed it.) Do you happen to have the image with the stack trace for that thread? http://img594.imageshack.us/img594/1773/21062011016.jpg http://img109.imageshack.us/img109/9937/21062011017.jpg http://img51.imageshack.us/img51/6047/21062011018l.jpg 'show msgbuf': http://img59.imageshack.us/img59/46/21062011019.jpg http://img189.imageshack.us/img189/483/21062011020.jpg http://img19.imageshack.us/img19/8163/21062011021.jpg http://img683.imageshack.us/img683/3171/21062011022.jpg http://img819.imageshack.us/img819/5923/21062011023.jpg http://img692.imageshack.us/img692/3789/21062011024.jpg http://img580.imageshack.us/img580/1550/21062011025.jpg http://img560.imageshack.us/img560/7478/21062011026.jpg http://img94.imageshack.us/img94/9371/21062011027.jpg http://img857.imageshack.us/img857/5185/21062011028.jpg Thanks, Ken -- Kenneth Merry k...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote: These two are interesting: http://img825.imageshack.us/img825/1249/21062011014m.jpg http://img839.imageshack.us/img839/3791/21062011015.jpg It looks like the GEOM event thread is stuck inside the cd(4) driver. The cd(4) driver is trying to acquire the peripheral lock, and is sleeping until it gets it. What isn't clear is who is holding it. The ps output shows an idle thread running on CPU 1, and thread 100014 (taskq) running on CPU 0. Unfortunately I don't see a stack trace for that. (I might have missed it.) Do you happen to have the image with the stack trace for that thread? I don't have the image because no disks are mounted at that stage and the swap slice is not attached. But I can issue more specific DDB commands to narrow it down, just say what you need in detail. BTW, the machine have 2 DVD both are attached to Marvell IDE plain ATA interface, they always works before. Are you sure that something holding the lock? 'show lock' shows absolutely nothing, it is empty. -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
Remove cd/acd from your kernel config to see if that allows you to boot? I unplug DVDs physically and kernel finally boots! BTW both DVDs was empty during the hanged boot and works normally under Win7. -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Mon, Jun 20, 2011 at 15:46:56 +0400, Andrey Chernov wrote: On Mon, Jun 20, 2011 at 11:01:46AM +0300, Kostik Belousov wrote: On Mon, Jun 20, 2011 at 11:02:22AM +0400, Andrey Chernov wrote: On Sun, Jun 19, 2011 at 08:15:43PM -0600, Justin T. Gibbs wrote: On 6/19/11 6:19 PM, Andrey Chernov wrote: Exactly that commit is responsible for boot hang. Please fix. BTW, I have MBR on SATA disk (CAM emulated), ICH9. Since it works for me, you'll need to provide more information. Can you at least drop into kdb to determine the likely source of the hang by getting a stack trace of all processes to see where they are sleeping and dumping lock information? I drop into DDB and put 'bt' console photo in the very first message of this thread - nothing unusual seen in the main stack. Could you please specify exact DDB commands you want to be issued by me? No dump can be provided since nothing is mounted yet including swap, BTW, I remember I saw previously unseen warnings with post Jun 14 kernels: xpt_action_default: CCB type 0xe not supported 'ps' inside DDB shows [xpt_thrd] at ccb_scan wmesg state and [g_event] at caplck wmesg state, [kernel] at g_waitid state. Even don't know, if it matters. Just in case, please try r223277. As the second message in the thread states, I try first even 223296 with the same hang and the same xpt_action_default: CCB type 0xe not supported As I think, DDB's 'ps' indicates that kernel waits something from geom and geom waits something from ccb_scan forever, just raw guess. I will be glad to issue more specific DDB commands and upload corresponding photos. BTW, pluging and unplugging USB devides works in that stage. Can you do the following when the hang happens: ps alltrace show locks show msgbuf Hopefully that will give us something to start looking at... This would really work a lot better if there is any way to get a serial console on the machine. The above will produce a good bit of output, and would likely need a lot of pictures. Since we can't reproduce the problem here, some debugging help would be greatly appreciated. Thanks, Ken -- Kenneth Merry k...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
Hi Andrey, On Mon, Jun 20, 2011 at 5:46 AM, Andrey Chernov a...@freebsd.org wrote: As the second message in the thread states, I try first even 223296 with the same hang and the same xpt_action_default: CCB type 0xe not supported As I think, DDB's 'ps' indicates that kernel waits something from geom and geom waits something from ccb_scan forever, just raw guess. I will be glad to issue more specific DDB commands and upload corresponding photos. BTW, pluging and unplugging USB devides works in that stage. Please try this patch: http://people.freebsd.org/~will/patches/ata_xpt_add_advinfo.0.diff Thanks. --Will. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
After applying the patch the system does not boot anymore! It hangs after probing for scsi devices. On Tue, Jun 21, 2011 at 8:58 PM, Will Andrews w...@firepipe.net wrote: Hi Andrey, On Mon, Jun 20, 2011 at 5:46 AM, Andrey Chernov a...@freebsd.org wrote: As the second message in the thread states, I try first even 223296 with the same hang and the same xpt_action_default: CCB type 0xe not supported As I think, DDB's 'ps' indicates that kernel waits something from geom and geom waits something from ccb_scan forever, just raw guess. I will be glad to issue more specific DDB commands and upload corresponding photos. BTW, pluging and unplugging USB devides works in that stage. Please try this patch: http://people.freebsd.org/~will/patches/ata_xpt_add_advinfo.0.diff Thanks. --Will. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org -- George Kontostanos aisecure.net http://www.aisecure.net ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Tue, Jun 21, 2011 at 11:58:17AM -0600, Will Andrews wrote: Hi Andrey, On Mon, Jun 20, 2011 at 5:46 AM, Andrey Chernov a...@freebsd.org wrote: As the second message in the thread states, I try first even 223296 with the same hang and the same xpt_action_default: CCB type 0xe not supported As I think, DDB's 'ps' indicates that kernel waits something from geom and geom waits something from ccb_scan forever, just raw guess. I will be glad to issue more specific DDB commands and upload corresponding photos. BTW, pluging and unplugging USB devides works in that stage. Please try this patch: http://people.freebsd.org/~will/patches/ata_xpt_add_advinfo.0.diff The patch does not fix the hang I talk about, but diagnostic is removed. -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Tue, Jun 21, 2011 at 10:17:19AM -0600, Kenneth D. Merry wrote: ps alltrace show locks show msgbuf Hopefully that will give us something to start looking at... This would really work a lot better if there is any way to get a serial console on the machine. The above will produce a good bit of output, and would likely need a lot of pictures. Since we can't reproduce the problem here, some debugging help would be greatly appreciated. Sorry I have no serial console. Here are the photos. I remove very similar looking USB parts from 'ps' and 'alltrace', and very general parts from 'alltrace' always been there. I hope remaining info will be enough. USB hotplagging works at this stage, so no reason to look there. If it will be not enough, I'll upload whole series. 'show lock' outputs nothing, it means no locks just sleep somewhere forever. 'ps': http://img43.imageshack.us/img43/1424/21062011001j.jpg http://img835.imageshack.us/img835/6607/21062011002.jpg http://img841.imageshack.us/img841/5401/21062011003.jpg 'alltrace': http://img864.imageshack.us/img864/6757/21062011004ya.jpg http://img542.imageshack.us/img542/4857/21062011005.jpg http://img828.imageshack.us/img828/823/21062011006.jpg http://img5.imageshack.us/img5/910/21062011007.jpg http://img7.imageshack.us/img7/4704/21062011008.jpg http://img848.imageshack.us/img848/5487/21062011009.jpg http://img641.imageshack.us/img641/2/21062011010.jpg http://img7.imageshack.us/img7/7946/21062011011.jpg http://img860.imageshack.us/img860/8185/21062011012.jpg http://img696.imageshack.us/img696/5276/21062011013.jpg http://img825.imageshack.us/img825/1249/21062011014m.jpg http://img839.imageshack.us/img839/3791/21062011015.jpg http://img594.imageshack.us/img594/1773/21062011016.jpg http://img109.imageshack.us/img109/9937/21062011017.jpg http://img51.imageshack.us/img51/6047/21062011018l.jpg 'show msgbuf': http://img59.imageshack.us/img59/46/21062011019.jpg http://img189.imageshack.us/img189/483/21062011020.jpg http://img19.imageshack.us/img19/8163/21062011021.jpg http://img683.imageshack.us/img683/3171/21062011022.jpg http://img819.imageshack.us/img819/5923/21062011023.jpg http://img692.imageshack.us/img692/3789/21062011024.jpg http://img580.imageshack.us/img580/1550/21062011025.jpg http://img560.imageshack.us/img560/7478/21062011026.jpg http://img94.imageshack.us/img94/9371/21062011027.jpg http://img857.imageshack.us/img857/5185/21062011028.jpg -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Sun, Jun 19, 2011 at 08:15:43PM -0600, Justin T. Gibbs wrote: On 6/19/11 6:19 PM, Andrey Chernov wrote: Exactly that commit is responsible for boot hang. Please fix. BTW, I have MBR on SATA disk (CAM emulated), ICH9. Since it works for me, you'll need to provide more information. Can you at least drop into kdb to determine the likely source of the hang by getting a stack trace of all processes to see where they are sleeping and dumping lock information? I drop into DDB and put 'bt' console photo in the very first message of this thread - nothing unusual seen in the main stack. Could you please specify exact DDB commands you want to be issued by me? No dump can be provided since nothing is mounted yet including swap, BTW, I remember I saw previously unseen warnings with post Jun 14 kernels: xpt_action_default: CCB type 0xe not supported 'ps' inside DDB shows [xpt_thrd] at ccb_scan wmesg state and [g_event] at caplck wmesg state, [kernel] at g_waitid state. Even don't know, if it matters. -- http://ache.vniz.net/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Mon, Jun 20, 2011 at 11:02:22AM +0400, Andrey Chernov wrote: On Sun, Jun 19, 2011 at 08:15:43PM -0600, Justin T. Gibbs wrote: On 6/19/11 6:19 PM, Andrey Chernov wrote: Exactly that commit is responsible for boot hang. Please fix. BTW, I have MBR on SATA disk (CAM emulated), ICH9. Since it works for me, you'll need to provide more information. Can you at least drop into kdb to determine the likely source of the hang by getting a stack trace of all processes to see where they are sleeping and dumping lock information? I drop into DDB and put 'bt' console photo in the very first message of this thread - nothing unusual seen in the main stack. Could you please specify exact DDB commands you want to be issued by me? No dump can be provided since nothing is mounted yet including swap, BTW, I remember I saw previously unseen warnings with post Jun 14 kernels: xpt_action_default: CCB type 0xe not supported 'ps' inside DDB shows [xpt_thrd] at ccb_scan wmesg state and [g_event] at caplck wmesg state, [kernel] at g_waitid state. Even don't know, if it matters. Just in case, please try r223277. pgp7CrzQfsgfV.pgp Description: PGP signature
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On Mon, Jun 20, 2011 at 11:01:46AM +0300, Kostik Belousov wrote: On Mon, Jun 20, 2011 at 11:02:22AM +0400, Andrey Chernov wrote: On Sun, Jun 19, 2011 at 08:15:43PM -0600, Justin T. Gibbs wrote: On 6/19/11 6:19 PM, Andrey Chernov wrote: Exactly that commit is responsible for boot hang. Please fix. BTW, I have MBR on SATA disk (CAM emulated), ICH9. Since it works for me, you'll need to provide more information. Can you at least drop into kdb to determine the likely source of the hang by getting a stack trace of all processes to see where they are sleeping and dumping lock information? I drop into DDB and put 'bt' console photo in the very first message of this thread - nothing unusual seen in the main stack. Could you please specify exact DDB commands you want to be issued by me? No dump can be provided since nothing is mounted yet including swap, BTW, I remember I saw previously unseen warnings with post Jun 14 kernels: xpt_action_default: CCB type 0xe not supported 'ps' inside DDB shows [xpt_thrd] at ccb_scan wmesg state and [g_event] at caplck wmesg state, [kernel] at g_waitid state. Even don't know, if it matters. Just in case, please try r223277. As the second message in the thread states, I try first even 223296 with the same hang and the same xpt_action_default: CCB type 0xe not supported As I think, DDB's 'ps' indicates that kernel waits something from geom and geom waits something from ccb_scan forever, just raw guess. I will be glad to issue more specific DDB commands and upload corresponding photos. BTW, pluging and unplugging USB devides works in that stage. -- http://ache.vniz.net/ pgpUHquEcbUw3.pgp Description: PGP signature
Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
On 6/19/11 6:19 PM, Andrey Chernov wrote: Exactly that commit is responsible for boot hang. Please fix. BTW, I have MBR on SATA disk (CAM emulated), ICH9. Since it works for me, you'll need to provide more information. Can you at least drop into kdb to determine the likely source of the hang by getting a stack trace of all processes to see where they are sleeping and dumping lock information? Thanks, Justin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org