Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-25 Thread Justin T. Gibbs

On 6/25/11 12:39 AM, Andrey Chernov wrote:

 On Fri, Jun 24, 2011 at 09:09:08PM -0400, Justin T. Gibbs wrote:
 No problem. I just set kern.geom.debugflags=4 in loader.conf and here is
 new photo (with recent kernel, no patches):
 http://img803.imageshack.us/img803/4679/25062011006.jpg
 I skip all noisy parts related to ada0 and ada1 partitions probes.
 As you can see, only 3 cd0-related geom call issued, right before cd1
 probe shown. Strange thing is that I see no single cd1-related geom
 call, but it may be because of hang.

 The GEOM processing is serialized, so that is not unexpected. What your
 logs are telling me is that the probe for CD0 is hanging. I don't know
 why.

 Could you just postpone GEOM calls after any probe will be completed? It
 seems GEOM goes here even before probe and waits for probe forever. What
 probe waits in the same time is unclear for me (ccb_scan), but CD devices
 are slow and may not survive such multisleeping, missing some 

responses in

 the middle.


The problem is not GEOM.  It's not the thread waiting in ccb_scan - that
thread is designed to wait there until an asynchronous device
arrival/departure event occurs which is not the case here.  The problem 
is in

or below CAM, and that problem is causing the probe to never complete.


 Are you positive it is this specific SVN revision that prevents cd0
 from probing properly and not one of my previous CAM commits?



 I use splitting by half method to find exact date which boots, then see
 the next commit above that date. Pre-commit kernel goes to multiuser and
 network is alive. I don't test CDs are working, I'll do that later and
 report it.


So you know that revisions 223081, 223084, 223085, and 223089 all boot
just fine?  I committed five revisions on that date.  223099 just happens
to be the last one for that day.

--
Justin

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-25 Thread Andrey Chernov
On Sat, Jun 25, 2011 at 07:27:20AM -0400, Justin T. Gibbs wrote:
   I use splitting by half method to find exact date which boots, then see
   the next commit above that date. Pre-commit kernel goes to multiuser and
   network is alive. I don't test CDs are working, I'll do that later and
   report it.
 
 So you know that revisions 223081, 223084, 223085, and 223089 all boot
 just fine?  I committed five revisions on that date.  223099 just happens
 to be the last one for that day.

OMG, I am very sorry to misleading you to wrong direction, and your 
time waste, give me pointy hat, please! It was r223081 which not boots.
I.e. kernel cvsupped at
src-sys date=2011.06.14.12.00.00
boots normally but cvsupped at
src-sys date=2011.06.14.14.54.00
i.e. in the same minute after r223081 committed 
Tue Jun 14 14:53:17 2011 UTC
hangs.
I don't know how I may miss commits you mention((

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-24 Thread Andrey Chernov
On Thu, Jun 23, 2011 at 11:01:17PM -0400, Justin T. Gibbs wrote:
 To test this theory, apply the following patch.  I do not know if this
 is safe for changer devices, but I will review the changer code if this
 patch fixes ache's problem.

I don't have changers. One of the plain ATA DVDs is read-only (cd0, which 
is not detected now) and another one is read-write (cd1, detected). I try 
the patch but nothing is changed in the picture. cd0 still not detected, 
xpt_thrd sleep in ccb_scan and g_event sleep in caplck. Here is probe from 
the old kernel which works, just in case, nothing unusual with it: 
cd0 at ata2 bus 0 scbus2 target 0 lun 0 
cd0: ASUS DVD-E616A 1.08 Removable CD-ROM SCSI-0 device 
cd0: 100.000MB/s transfers (UDMA5, ATAPI 12bytes, PIO 65534bytes) 
cd0: Attempt to query device size failed: NOT READY, Medium not present - tray 
closed Unplugging power from it allows to boot 
normally. 
Inserting media does not help too.

 --- //depot/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c2011-05-07 
 10:06:43.0 -0600
 +++ /home/justing/perforce/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c 
 2011-05-07 10:06:43.0 -0600
 @@ -687,6 +687,10 @@
   else
   softc-minimum_command_size = 6;
  
 + /*
 +  * Refcount and block open attempts until we are setup
 +  * Can't block
 +  */
   (void)cam_periph_hold(periph, PRIBIO);
   cam_periph_unlock(periph);
   /*
 @@ -747,7 +751,6 @@
   softc-disk-d_hba_subdevice = cpi.hba_subdevice;
   disk_create(softc-disk, DISK_VERSION);
   cam_periph_lock(periph);
 - cam_periph_unhold(periph);
  
   /*
* Add an async callback so that we get
 @@ -972,12 +975,6 @@
  
  cdregisterexit:
  
 - /*
 -  * Refcount and block open attempts until we are setup
 -  * Can't block
 -  */
 - (void)cam_periph_hold(periph, PRIBIO);
 -
   if ((softc-flags  CD_FLAG_CHANGER) == 0)
   xpt_schedule(periph, CAM_PRIORITY_DEV);
   else

 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org



-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-24 Thread Andrey Chernov
I forget to mention that the place is the same as trace shows: 
g_event_procbody-g_run_events-g_new_provider_event-g_dev_taste-
g_dev_attrchanged-g_access-g_disk_access-cdopen-cam_periph_hold 
and it sleeps.

On Fri, Jun 24, 2011 at 05:08:27PM +0400, Andrey Chernov wrote:
 On Thu, Jun 23, 2011 at 11:01:17PM -0400, Justin T. Gibbs wrote:
  To test this theory, apply the following patch.  I do not know if this
  is safe for changer devices, but I will review the changer code if this
  patch fixes ache's problem.
 
 I don't have changers. One of the plain ATA DVDs is read-only (cd0, which 
 is not detected now) and another one is read-write (cd1, detected). I try 
 the patch but nothing is changed in the picture. cd0 still not detected, 
 xpt_thrd sleep in ccb_scan and g_event sleep in caplck. Here is probe from 
 the old kernel which works, just in case, nothing unusual with it: 
 cd0 at ata2 bus 0 scbus2 target 0 lun 0 
 cd0: ASUS DVD-E616A 1.08 Removable CD-ROM SCSI-0 device 
 cd0: 100.000MB/s transfers (UDMA5, ATAPI 12bytes, PIO 65534bytes) 
 cd0: Attempt to query device size failed: NOT READY, Medium not present - 
 tray closed Unplugging power from it allows to boot 
 normally. 
 Inserting media does not help too.
 
  --- //depot/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c  2011-05-07 
  10:06:43.0 -0600
  +++ /home/justing/perforce/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c   
  2011-05-07 10:06:43.0 -0600
  @@ -687,6 +687,10 @@
  else
  softc-minimum_command_size = 6;
   
  +   /*
  +* Refcount and block open attempts until we are setup
  +* Can't block
  +*/
  (void)cam_periph_hold(periph, PRIBIO);
  cam_periph_unlock(periph);
  /*
  @@ -747,7 +751,6 @@
  softc-disk-d_hba_subdevice = cpi.hba_subdevice;
  disk_create(softc-disk, DISK_VERSION);
  cam_periph_lock(periph);
  -   cam_periph_unhold(periph);
   
  /*
   * Add an async callback so that we get
  @@ -972,12 +975,6 @@
   
   cdregisterexit:
   
  -   /*
  -* Refcount and block open attempts until we are setup
  -* Can't block
  -*/
  -   (void)cam_periph_hold(periph, PRIBIO);
  -
  if ((softc-flags  CD_FLAG_CHANGER) == 0)
  xpt_schedule(periph, CAM_PRIORITY_DEV);
  else
 
  ___
  freebsd-current@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-current
  To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
 
 
 
 -- 
 http://ache.vniz.net/
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-24 Thread Scott Long

On Jun 23, 2011, at 11:18 PM, Scott Long wrote:

 
 On Jun 23, 2011, at 9:01 PM, Justin T. Gibbs wrote:
 
 On 6/22/11 4:09 PM, Kenneth D. Merry wrote:
 On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
 On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
 These two are interesting:
 
 http://img825.imageshack.us/img825/1249/21062011014m.jpg
 http://img839.imageshack.us/img839/3791/21062011015.jpg
 
 It looks like the GEOM event thread is stuck inside the cd(4) 
 driver. The
 cd(4) driver is trying to acquire the peripheral lock, and is sleeping
 until it gets it.
 
 What isn't clear is who is holding it.
 
 ...
 
 The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So
 that tells me that one of several things is going on:
 
 - There is a path in the cd(4) driver where it can call cam_periph_hold()
 but not cam_periph_unhold().
 
 - There is another thread in the system that has called cam_periph_hold(),
 and has gotten stuck before it can call cam_periph_unhold().
 
 - The hold/unhold logic is broken, and there is a case where a thread
 waiting for the lock can miss the wakeup. After looking at the code, I
 don't think this is the case, but I may have missed something.
 
 So it is probably one of the first two cases.
 
 ...
 
 I have a theory for the cause of this hang.
 
 The commit that triggers this problem added calls to g_access() during the
 geom_dev probe.  I believe this hit a race in cdregister() where
 the periph hold lock is dropped around the changer probe code.  Why the
 periph hold lock is dropped there, I do not know as I haven't fully
 reviewed the changer probe code.
 
 
 Are you talking about this?
 
disk_create(softc-disk, DISK_VERSION);
cam_periph_lock(periph);
cam_periph_unhold(periph);
 [...]
if (((cgd-ccb_h.target_lun  0)
   ((softc-quirks  CD_Q_NO_CHANGER) == 0))
 || ((softc-quirks  CD_Q_CHANGER) != 0)) {
 
 The unhold there compliments the hold that was done prior to disk_create().  
 The hold/unhold is done as a hack around the need to drop the periph/sim 
 mutex while calling disk_create(), due to the later's insistence on using 
 blocking calls.  I've wanted to re-think how that pattern is done (it's the 
 same gross hack in nearly all of the periph drivers), but haven't gotten 
 around to it.  If the 'hold' semaphore needs to be held longer to prevent the 
 race that you're theorizing, then it should be possible to simply extend its 
 coverage in the code block, but I'm not sure if it'll result in an unintended 
 deadlock with the changer enumeration/matching code.  I _think_ that it'll be 
 ok, but the density of magic in the code is a bit overwhelming at this time 
 of night =-)
 

Actually, what's probably happening is that the sim/periph lock is being 
dropped but the hold semaphore is held, disk_create() is called, which kicks 
off GEOM to do GEOM-ish things including the new g_access() call.  It tries to 
call cdopen(), which grabs the periph/sim mutex in order to then get the hold 
semaphore, and winds up sleeping because the semaphore is already held in 
cdregister().  If/when disk_create() returns, cdregister() winds up waiting for 
the periph lock because it's held by cdopen(), and now you have a deadlock 
between the two.

Again, this is my fault for being lazy and not re-organzing the periph drivers 
so that disk_create() is safely called without shady locking hacks.  I don't 
think that extending the coverage of the hold semaphore is going to help in 
this case.  The periphs need to adopt a new pattern where disk_create() isn't 
called until the periph is completely initialized and ready for periph_open() 
to be called without any locking gymnastics.

Scott


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-24 Thread Justin T. Gibbs

On 6/24/11 3:35 PM, Scott Long wrote:


 On Jun 23, 2011, at 11:18 PM, Scott Long wrote:


 On Jun 23, 2011, at 9:01 PM, Justin T. Gibbs wrote:

 On 6/22/11 4:09 PM, Kenneth D. Merry wrote:
 On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
 On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
 These two are interesting:

 http://img825.imageshack.us/img825/1249/21062011014m.jpg
 http://img839.imageshack.us/img839/3791/21062011015.jpg

 It looks like the GEOM event thread is stuck inside the cd(4)
 driver. The
 cd(4) driver is trying to acquire the peripheral lock, and is 

sleeping

 until it gets it.

 What isn't clear is who is holding it.

 ...

 The GEOM event thread is stuck sleeping in the mtx_sleep() call 

above. So

 that tells me that one of several things is going on:

 - There is a path in the cd(4) driver where it can call 

cam_periph_hold()

 but not cam_periph_unhold().

 - There is another thread in the system that has called 

cam_periph_hold(),

 and has gotten stuck before it can call cam_periph_unhold().

 - The hold/unhold logic is broken, and there is a case where a thread
 waiting for the lock can miss the wakeup. After looking at the code, I
 don't think this is the case, but I may have missed something.

 So it is probably one of the first two cases.

 ...

 I have a theory for the cause of this hang.

 The commit that triggers this problem added calls to g_access() 

during the

 geom_dev probe. I believe this hit a race in cdregister() where
 the periph hold lock is dropped around the changer probe code. Why the
 periph hold lock is dropped there, I do not know as I haven't fully
 reviewed the changer probe code.


 Are you talking about this?

 disk_create(softc-disk, DISK_VERSION);
 cam_periph_lock(periph);
 cam_periph_unhold(periph);
 [...]
 if (((cgd-ccb_h.target_lun  0)
  ((softc-quirks  CD_Q_NO_CHANGER) == 0))
 || ((softc-quirks  CD_Q_CHANGER) != 0)) {

 The unhold there compliments the hold that was done prior to 

disk_create().

 The hold/unhold is done as a hack around the need to drop the periph/sim
 mutex while calling disk_create(), due to the later's insistence on using
 blocking calls. I've wanted to re-think how that pattern is done (it's
 the same gross hack in nearly all of the periph drivers), but haven't
 gotten around to it. If the 'hold' semaphore needs to be held longer to
 prevent the race that you're theorizing, then it should be possible to
 simply extend its coverage in the code block, but I'm not sure if it'll
 result in an unintended deadlock with the changer enumeration/matching
 code. I _think_ that it'll be ok, but the density of magic in the code
 is a bit overwhelming at this time of night =-)


The cd driver is the only one that temporarily drops the periph hold
semaphore before scheduling and completing the probe processing.  I think
the above race could happen but that there is something else causing the
current hang ache is experiencing.


 Actually, what's probably happening is that the sim/periph lock is being
 dropped but the hold semaphore is held, disk_create() is called, which
 kicks off GEOM to do GEOM-ish things including the new g_access() call.
 It tries to call cdopen(), which grabs the periph/sim mutex in order to
 then get the hold semaphore, and winds up sleeping because the semaphore
 is already held in cdregister(). If/when disk_create() returns,
 cdregister() winds up waiting for the periph lock because it's held by
 cdopen(), and now you have a deadlock between the two.


The sim mutex is dropped while waiting for the periph hold semaphore, so
I don't understand the deadlock you are describing.

If there is still a traditional deadlock here, I'd expect to find a thread
somewhere blocking on a mutex required to release the periph lock semaphore.
Ache hasn't reported a substantive change to the ddb output, just that
instead of g_eli blocked in open, it is now g_dev.

Instead, I believe that either one of the GEOM taste methods is leaking an
access reference (so cdclose() is not called), or the CD driver is failing
to release the hold semaphore during probing.  Setting kern.geom.debugflags
to '4' will trace the access calls and allow the GEOM side to be ruled out.
If GEOM is exonerated, we can add tracing to cam_perihp_(un)hold to track
this down further.

--
Justin

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-24 Thread Andrey Chernov
On Fri, Jun 24, 2011 at 04:20:24PM -0400, Justin T. Gibbs wrote:
 Instead, I believe that either one of the GEOM taste methods is leaking an
 access reference (so cdclose() is not called), or the CD driver is failing
 to release the hold semaphore during probing.  Setting kern.geom.debugflags
 to '4' will trace the access calls and allow the GEOM side to be ruled out.
 If GEOM is exonerated, we can add tracing to cam_perihp_(un)hold to track
 this down further.

No problem. I just set kern.geom.debugflags=4 in loader.conf and here is 
new photo (with recent kernel, no patches):
http://img803.imageshack.us/img803/4679/25062011006.jpg
I skip all noisy parts related to ada0 and ada1 partitions probes.
As you can see, only 3 cd0-related geom call issued, right before cd1 
probe shown. Strange thing is that I see no single cd1-related geom 
call, but it may be because of hang.

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-24 Thread Justin T. Gibbs

On 6/24/11 6:26 PM, Andrey Chernov wrote:

 On Fri, Jun 24, 2011 at 04:20:24PM -0400, Justin T. Gibbs wrote:
 Instead, I believe that either one of the GEOM taste methods is 

leaking an
 access reference (so cdclose() is not called), or the CD driver is 

failing
 to release the hold semaphore during probing. Setting 

kern.geom.debugflags
 to '4' will trace the access calls and allow the GEOM side to be 

ruled out.

 If GEOM is exonerated, we can add tracing to cam_perihp_(un)hold to track
 this down further.

 No problem. I just set kern.geom.debugflags=4 in loader.conf and here is
 new photo (with recent kernel, no patches):
 http://img803.imageshack.us/img803/4679/25062011006.jpg
 I skip all noisy parts related to ada0 and ada1 partitions probes.
 As you can see, only 3 cd0-related geom call issued, right before cd1
 probe shown. Strange thing is that I see no single cd1-related geom
 call, but it may be because of hang.


The GEOM processing is serialized, so that is not unexpected.  What your
logs are telling me is that the probe for CD0 is hanging.  I don't know
why.

Are you positive it is this specific SVN revision that prevents cd0
from probing properly and not one of my previous CAM commits?  Just
getting to multi-user doesn't mean we're ok here.  My GEOM changes may
make the system hang earlier, but you'll need to test access to cd0
even if you get to multi-user mode to be sure that the device is
functioning correctly.  I just want to be positive that we're barking
up the right tree.

--
Justin

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-24 Thread Andrey Chernov
On Fri, Jun 24, 2011 at 09:09:08PM -0400, Justin T. Gibbs wrote:
   No problem. I just set kern.geom.debugflags=4 in loader.conf and here is
   new photo (with recent kernel, no patches):
   http://img803.imageshack.us/img803/4679/25062011006.jpg
   I skip all noisy parts related to ada0 and ada1 partitions probes.
   As you can see, only 3 cd0-related geom call issued, right before cd1
   probe shown. Strange thing is that I see no single cd1-related geom
   call, but it may be because of hang.
 
 The GEOM processing is serialized, so that is not unexpected.  What your
 logs are telling me is that the probe for CD0 is hanging.  I don't know
 why.

Could you just postpone GEOM calls after any probe will be completed? It 
seems GEOM goes here even before probe and waits for probe forever. What 
probe waits in the same time is unclear for me (ccb_scan), but CD devices 
are slow and may not survive such multisleeping, missing some responses in 
the middle.

 Are you positive it is this specific SVN revision that prevents cd0
 from probing properly and not one of my previous CAM commits?  Just
 getting to multi-user doesn't mean we're ok here.  My GEOM changes may
 make the system hang earlier, but you'll need to test access to cd0
 even if you get to multi-user mode to be sure that the device is
 functioning correctly.  I just want to be positive that we're barking
 up the right tree.

I use splitting by half method to find exact date which boots, then see 
the next commit above that date. Pre-commit kernel goes to multiuser and 
network is alive. I don't test CDs are working, I'll do that later and 
report it.

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-24 Thread Andrey Chernov
On Sat, Jun 25, 2011 at 08:39:17AM +0400, Andrey Chernov wrote:
  Are you positive it is this specific SVN revision that prevents cd0
  from probing properly and not one of my previous CAM commits?  Just
  getting to multi-user doesn't mean we're ok here.  My GEOM changes may
  make the system hang earlier, but you'll need to test access to cd0
  even if you get to multi-user mode to be sure that the device is
  functioning correctly.  I just want to be positive that we're barking
  up the right tree.
 
 I use splitting by half method to find exact date which boots, then see 
 the next commit above that date. Pre-commit kernel goes to multiuser and 
 network is alive. I don't test CDs are working, I'll do that later and 
 report it.

Just test it, kernel dated
src-sys date=2011.06.14.12.00.00
Both DVDs are mounted normally and files are readable on both.
Here is g_* debugging and cd* related parts from dmesg of working kernel 
(with few lines before/after to hint the place). As you can see, cd0-probe 
appears far later in dmesg, but cd0-geom (not much later cd1-geom) is 
earlier thing:
...
ada0 at ahcich2 bus 0 scbus3 target 0 lun 0
ada0: SAMSUNG HD502HJ 1AJ10001 ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
g_access(0xc652fe40(cd0), 1, 0, 0)
open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4500(cd0)
g_disk_access(cd0, 1, 0, 0)
ada0: 476940MB (976773168 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad10
ada1 at ahcich3 bus 0 scbus4 target 0 lun 0
ada1: ST31500341AS CC1H ATA-8 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
ada1: Previously was known as ad12
...
Timecounter TSC-low frequency 14077952 Hz quality 1000
WARNING: WITNESS option enabled, expect reduced performance.
cd1 at ata2 bus 0 scbus2 target 1 lun 0
cd1: PLEXTOR DVDR   PX-740A 1.02 Removable CD-ROM SCSI-0 device 
cd1: 33.300MB/s transfers (UDMA2, ATAPI 12bytes, PIO 65534bytes)
cd1: Attempt to query device size failed: NOT READY, Medium not present
uhub0: 2 ports with 2 removable, self powered
...
uhub7: 6 ports with 6 removable, self powered
cd0 at ata2 bus 0 scbus2 target 0 lun 0
cd0: ASUS DVD-E616A 1.08 Removable CD-ROM SCSI-0 device 
cd0: 100.000MB/s transfers (UDMA5, ATAPI 12bytes, PIO 65534bytes)
cd0: Attempt to query device size failed: NOT READY, Medium not present - tray 
closed
g_access(0xc652fe40(cd0), -1, 0, 0)
open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4500(cd0)
g_disk_access(cd0, -1, 0, 0)
g_access(0xc652fe00(cd0), 1, 0, 0)
open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4500(cd0)
g_disk_access(cd0, 1, 0, 0)
g_access(0xc652fe00(cd0), -1, 0, 0)
open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4500(cd0)
g_disk_access(cd0, -1, 0, 0)
g_access(0xc652fdc0(cd0), 1, 0, 0)
open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4500(cd0)
g_disk_access(cd0, 1, 0, 0)
g_access(0xc652fdc0(cd0), -1, 0, 0)
open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4500(cd0)
g_disk_access(cd0, -1, 0, 0)
g_access(0xc652fd40(cd1), 1, 0, 0)
open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4280(cd1)
g_disk_access(cd1, 1, 0, 0)
g_access(0xc652fd40(cd1), -1, 0, 0)
open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4280(cd1)
g_disk_access(cd1, -1, 0, 0)
g_access(0xc652fd00(cd1), 1, 0, 0)
open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4280(cd1)
g_disk_access(cd1, 1, 0, 0)
g_access(0xc652fd00(cd1), -1, 0, 0)
open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4280(cd1)
g_disk_access(cd1, -1, 0, 0)
g_access(0xc652fcc0(cd1), 1, 0, 0)
open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xc65d4280(cd1)
g_disk_access(cd1, 1, 0, 0)
g_access(0xc652fcc0(cd1), -1, 0, 0)
open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xc65d4280(cd1)
g_disk_access(cd1, -1, 0, 0)
g_access(0xc6b1c040(ada0), 1, 0, 0)
...

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-23 Thread Andriy Gapon
on 22/06/2011 23:09 Kenneth D. Merry said the following:
 The GEOM event thread is stuck sleeping in the mtx_sleep() call above.  So
 that tells me that one of several things is going on:
 
  - There is a path in the cd(4) driver where it can call cam_periph_hold()
but not cam_periph_unhold().
 
  - There is another thread in the system that has called cam_periph_hold(),
and has gotten stuck before it can call cam_periph_unhold().
 
  - The hold/unhold logic is broken, and there is a case where a thread
waiting for the lock can miss the wakeup.  After looking at the code, I
don't think this is the case, but I may have missed something.
 
 So it is probably one of the first two cases.  From the dmesg, I only see
 cd1 listed, not cd0.  So it is possible that cd0 is stuck in the probe code
 somewhere, and the geom code just gets stuck trying to open it when the
 probe hasn't completed.
 
 Seeing the stack trace for the taskq thread that is running on CPU 0
 (process 100014) might be enlightening, it's hard to say.  That may or may
 not show the issue.
 
 It's possible that this issue is directly related to the commit in
 question; perhaps there is an error being returned that wasn't returned
 before and it isn't being handled right in the cd(4) driver.  (The cd(4)
 driver wasn't touched in the commit.)
 
 It's also possible that the commit in question just changed the timing and
 your system is hitting a race that was there previously.

I have a suspicion that this is actually the case.
More than once I've seen under qemu that the kernel boot non-deterministically
gets stuck in the cd driver.  Other people have also bumped into this.
E.g., here's one of the reports that I googled up, it's not exactly the same as
what ache has reported, but somewhat similar:
http://lists.freebsd.org/pipermail/freebsd-current/2010-October/020336.html

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-23 Thread Andrey Chernov
Apparently there is another problem plain ATA CD/DVD related. With r223443
hangs nature is changed: I see no more waiting in caplck state, just 
xpt_thrd waiting in ccb_scan state forever and those repeated messages:
run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
run_interrupt_driven_hooks: still waiting after 120 seconds for xpt_config
...
and so on.

On Wed, Jun 22, 2011 at 02:09:19PM -0600, Kenneth D. Merry wrote:
 On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
  On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
   These two are interesting:
   
http://img825.imageshack.us/img825/1249/21062011014m.jpg
http://img839.imageshack.us/img839/3791/21062011015.jpg
   
   It looks like the GEOM event thread is stuck inside the cd(4) driver.  The
   cd(4) driver is trying to acquire the peripheral lock, and is sleeping
   until it gets it.
   
   What isn't clear is who is holding it.  The ps output shows an idle thread
   running on CPU 1, and thread 100014 (taskq) running on CPU 0.
   Unfortunately I don't see a stack trace for that.  (I might have missed
   it.)
   
   Do you happen to have the image with the stack trace for that thread?
  
  I don't have the image because no disks are mounted at that stage and the 
  swap slice is not attached. But I can issue more specific DDB commands to 
  narrow it down, just say what you need in detail.
  
  BTW, the machine have 2 DVD both are attached to Marvell IDE plain ATA 
  interface, they always works before.
  
  Are you sure that something holding the lock? 'show lock' shows absolutely 
  nothing, it is empty.
 
 Well, after looking at the code a little more, it looks like the lock
 that is being held is the periph lock, which is really just a flag.
 So 'show lock' wouldn't show anything relevant.  Here's cam_periph_hold():
 
 int
 cam_periph_hold(struct cam_periph *periph, int priority)
 {
   int error;
 
   /*
* Increment the reference count on the peripheral
* while we wait for our lock attempt to succeed
* to ensure the peripheral doesn't disappear out
* from user us while we sleep.
*/
 
   if (cam_periph_acquire(periph) != CAM_REQ_CMP)
   return (ENXIO);
 
   mtx_assert(periph-sim-mtx, MA_OWNED);
   while ((periph-flags  CAM_PERIPH_LOCKED) != 0) {
   periph-flags |= CAM_PERIPH_LOCK_WANTED;
   if ((error = mtx_sleep(periph, periph-sim-mtx, priority,
caplck, 0)) != 0) {
   cam_periph_release_locked(periph);
   return (error);
   }
   }
 
   periph-flags |= CAM_PERIPH_LOCKED;
   return (0);
 }
 
 The GEOM event thread is stuck sleeping in the mtx_sleep() call above.  So
 that tells me that one of several things is going on:
 
  - There is a path in the cd(4) driver where it can call cam_periph_hold()
but not cam_periph_unhold().
 
  - There is another thread in the system that has called cam_periph_hold(),
and has gotten stuck before it can call cam_periph_unhold().
 
  - The hold/unhold logic is broken, and there is a case where a thread
waiting for the lock can miss the wakeup.  After looking at the code, I
don't think this is the case, but I may have missed something.
 
 So it is probably one of the first two cases.  From the dmesg, I only see
 cd1 listed, not cd0.  So it is possible that cd0 is stuck in the probe code
 somewhere, and the geom code just gets stuck trying to open it when the
 probe hasn't completed.
 
 Seeing the stack trace for the taskq thread that is running on CPU 0
 (process 100014) might be enlightening, it's hard to say.  That may or may
 not show the issue.
 
 It's possible that this issue is directly related to the commit in
 question; perhaps there is an error being returned that wasn't returned
 before and it isn't being handled right in the cd(4) driver.  (The cd(4)
 driver wasn't touched in the commit.)
 
 It's also possible that the commit in question just changed the timing and
 your system is hitting a race that was there previously.
 
 Ken
 -- 
 Kenneth Merry
 k...@freebsd.org


-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-23 Thread Andrey Chernov
On Thu, Jun 23, 2011 at 03:51:36PM +0300, Andriy Gapon wrote:
 More than once I've seen under qemu that the kernel boot non-deterministically
 gets stuck in the cd driver.  Other people have also bumped into this.
 E.g., here's one of the reports that I googled up, it's not exactly the same 
 as
 what ache has reported, but somewhat similar:
 http://lists.freebsd.org/pipermail/freebsd-current/2010-October/020336.html

Thread named as xpt_thrd might be of particular interest.
See my reply to ken (after applying recent r223443), caplck 
seems gone, but xpt_thrd waiting forever in ccb_scan with repeated 
messages:
run_interrupt_driven_hooks: still waiting after XXX seconds for xpt_config

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-23 Thread Andrey Chernov
On Wed, Jun 22, 2011 at 02:09:19PM -0600, Kenneth D. Merry wrote:
 Well, after looking at the code a little more, it looks like the lock
 that is being held is the periph lock, which is really just a flag.
 So 'show lock' wouldn't show anything relevant.  Here's cam_periph_hold():

With recent r223475 situation returned to what you describe (and still no 
cd0 in probe). As tracing 18 show, it was cdopen() who calls 
cam_periph_hold() and sleeps there forever.
Tracing 100014 as you suggest shows almost nothing: only fork_trampoline() 
is there.

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-23 Thread Justin T. Gibbs

On 6/22/11 4:09 PM, Kenneth D. Merry wrote:

 On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
 On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
 These two are interesting:

 http://img825.imageshack.us/img825/1249/21062011014m.jpg
 http://img839.imageshack.us/img839/3791/21062011015.jpg

 It looks like the GEOM event thread is stuck inside the cd(4) 

driver. The

 cd(4) driver is trying to acquire the peripheral lock, and is sleeping
 until it gets it.

 What isn't clear is who is holding it.


...


 The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So
 that tells me that one of several things is going on:

 - There is a path in the cd(4) driver where it can call cam_periph_hold()
 but not cam_periph_unhold().

 - There is another thread in the system that has called cam_periph_hold(),
 and has gotten stuck before it can call cam_periph_unhold().

 - The hold/unhold logic is broken, and there is a case where a thread
 waiting for the lock can miss the wakeup. After looking at the code, I
 don't think this is the case, but I may have missed something.

 So it is probably one of the first two cases.


...

I have a theory for the cause of this hang.

The commit that triggers this problem added calls to g_access() during the
geom_dev probe.  I believe this hit a race in cdregister() where
the periph hold lock is dropped around the changer probe code.  Why the
periph hold lock is dropped there, I do not know as I haven't fully
reviewed the changer probe code.

The drop of the lock in cdregister() can allow geom classes to probe and
thus call g_access()-g_disk_access()-cdopen() before a probe is initiated
in the normal way by cdregister().  cdopen() checks for media presence
by issuing immediate ccds.  When the race is exploited, the peripheral will
be in the probe state when the immediate ccbs are requested.  This will
cause the device probe to be performed before the immediate ccd is returned.
When the cdopen() activity finally unwinds, cdregister() will again
take the periph hold lock and schedule the peripheral, expecting probe
processing to complete and release the hold lock.  However, since the
periph is already in the normal state (due to the successful probe performed
indirectly by the cdopen() call), that unlock never happens, thus wedging
the device.

To test this theory, apply the following patch.  I do not know if this
is safe for changer devices, but I will review the changer code if this
patch fixes ache's problem.

--
Justin

--- //depot/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c  2011-05-07 
10:06:43.0 -0600
+++ /home/justing/perforce/vendor/FreeBSD/head/sys/cam/scsi/scsi_cd.c   
2011-05-07 10:06:43.0 -0600
@@ -687,6 +687,10 @@
else
softc-minimum_command_size = 6;
 
+   /*
+* Refcount and block open attempts until we are setup
+* Can't block
+*/
(void)cam_periph_hold(periph, PRIBIO);
cam_periph_unlock(periph);
/*
@@ -747,7 +751,6 @@
softc-disk-d_hba_subdevice = cpi.hba_subdevice;
disk_create(softc-disk, DISK_VERSION);
cam_periph_lock(periph);
-   cam_periph_unhold(periph);
 
/*
 * Add an async callback so that we get
@@ -972,12 +975,6 @@
 
 cdregisterexit:
 
-   /*
-* Refcount and block open attempts until we are setup
-* Can't block
-*/
-   (void)cam_periph_hold(periph, PRIBIO);
-
if ((softc-flags  CD_FLAG_CHANGER) == 0)
xpt_schedule(periph, CAM_PRIORITY_DEV);
else
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-23 Thread Scott Long

On Jun 23, 2011, at 9:01 PM, Justin T. Gibbs wrote:

 On 6/22/11 4:09 PM, Kenneth D. Merry wrote:
 On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
  On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
  These two are interesting:
 
  http://img825.imageshack.us/img825/1249/21062011014m.jpg
  http://img839.imageshack.us/img839/3791/21062011015.jpg
 
  It looks like the GEOM event thread is stuck inside the cd(4) 
 driver. The
  cd(4) driver is trying to acquire the peripheral lock, and is sleeping
  until it gets it.
 
  What isn't clear is who is holding it.
 
 ...
 
 The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So
 that tells me that one of several things is going on:
 
 - There is a path in the cd(4) driver where it can call cam_periph_hold()
 but not cam_periph_unhold().
 
 - There is another thread in the system that has called cam_periph_hold(),
 and has gotten stuck before it can call cam_periph_unhold().
 
 - The hold/unhold logic is broken, and there is a case where a thread
 waiting for the lock can miss the wakeup. After looking at the code, I
 don't think this is the case, but I may have missed something.
 
 So it is probably one of the first two cases.
 
 ...
 
 I have a theory for the cause of this hang.
 
 The commit that triggers this problem added calls to g_access() during the
 geom_dev probe.  I believe this hit a race in cdregister() where
 the periph hold lock is dropped around the changer probe code.  Why the
 periph hold lock is dropped there, I do not know as I haven't fully
 reviewed the changer probe code.
 

Are you talking about this?

disk_create(softc-disk, DISK_VERSION);
cam_periph_lock(periph);
cam_periph_unhold(periph);
[...]
if (((cgd-ccb_h.target_lun  0)
   ((softc-quirks  CD_Q_NO_CHANGER) == 0))
 || ((softc-quirks  CD_Q_CHANGER) != 0)) {

The unhold there compliments the hold that was done prior to disk_create().  
The hold/unhold is done as a hack around the need to drop the periph/sim mutex 
while calling disk_create(), due to the later's insistence on using blocking 
calls.  I've wanted to re-think how that pattern is done (it's the same gross 
hack in nearly all of the periph drivers), but haven't gotten around to it.  If 
the 'hold' semaphore needs to be held longer to prevent the race that you're 
theorizing, then it should be possible to simply extend its coverage in the 
code block, but I'm not sure if it'll result in an unintended deadlock with the 
changer enumeration/matching code.  I _think_ that it'll be ok, but the density 
of magic in the code is a bit overwhelming at this time of night =-)

Scott



___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-22 Thread Garrett Cooper
On Tue, Jun 21, 2011 at 9:51 PM, Andrey Chernov a...@freebsd.org wrote:
 Remove cd/acd from your kernel config to see if that allows
 you to boot?

 I unplug DVDs physically and kernel finally boots!
 BTW both DVDs was empty during the hanged boot and works normally under
 Win7.

Put a DVD in each of the drives, like so:

Case 1:
- Drive one populated.
- Drive two empty.

Case 2:
- Drive one empty.
- Drive two populated.

Case 3:
- Drive one populated.
- Drive two populated.

See what happens. I've run into issues with FreeBSD in the past with
some motherboards / drives / BIOSes where things were hung up unless I
put a DVD in my drive (or vice versa) because of cd(4)/cam(4).
Thanks,
-Garrett
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-22 Thread Andrey Chernov
On Tue, Jun 21, 2011 at 11:19:38PM -0700, Garrett Cooper wrote:
 On Tue, Jun 21, 2011 at 9:51 PM, Andrey Chernov a...@freebsd.org wrote:
  Remove cd/acd from your kernel config to see if that allows
  you to boot?
 
  I unplug DVDs physically and kernel finally boots!
  BTW both DVDs was empty during the hanged boot and works normally under
  Win7.
 
 Put a DVD in each of the drives, like so:
 
 Case 1:
 - Drive one populated.
 - Drive two empty.
 
 Case 2:
 - Drive one empty.
 - Drive two populated.
 
 Case 3:
 - Drive one populated.
 - Drive two populated.
 
 See what happens. I've run into issues with FreeBSD in the past with
 some motherboards / drives / BIOSes where things were hung up unless I
 put a DVD in my drive (or vice versa) because of cd(4)/cam(4).

I test both DVD inside case and it hangs.

Moreover, with both DVD inside DVD devices are even not detected at all as 
previously (the machine hangs even earlier).
And I have messages I don't see before:
run_interrupt_driven_hooks: still waiting after 60 seconds for 
xpt_config
run_interrupt_driven_hooks: still waiting after 120 seconds for 
xpt_config
...
and so on.

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-22 Thread Andrey Chernov
I forget to add that only 1 disk inside each drive doesn't change this 
picture at all.

On Wed, Jun 22, 2011 at 10:42:30AM +0400, Andrey Chernov wrote:
 On Tue, Jun 21, 2011 at 11:19:38PM -0700, Garrett Cooper wrote:
  On Tue, Jun 21, 2011 at 9:51 PM, Andrey Chernov a...@freebsd.org wrote:
   Remove cd/acd from your kernel config to see if that allows
   you to boot?
  
   I unplug DVDs physically and kernel finally boots!
   BTW both DVDs was empty during the hanged boot and works normally under
   Win7.
  
  Put a DVD in each of the drives, like so:
  
  Case 1:
  - Drive one populated.
  - Drive two empty.
  
  Case 2:
  - Drive one empty.
  - Drive two populated.
  
  Case 3:
  - Drive one populated.
  - Drive two populated.
  
  See what happens. I've run into issues with FreeBSD in the past with
  some motherboards / drives / BIOSes where things were hung up unless I
  put a DVD in my drive (or vice versa) because of cd(4)/cam(4).
 
 I test both DVD inside case and it hangs.
 
 Moreover, with both DVD inside DVD devices are even not detected at all as 
 previously (the machine hangs even earlier).
 And I have messages I don't see before:
 run_interrupt_driven_hooks: still waiting after 60 seconds for 
 xpt_config
 run_interrupt_driven_hooks: still waiting after 120 seconds for 
 xpt_config
 ...
 and so on.
 
 -- 
 http://ache.vniz.net/
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-22 Thread Kenneth D. Merry
On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
 On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
  These two are interesting:
  
   http://img825.imageshack.us/img825/1249/21062011014m.jpg
   http://img839.imageshack.us/img839/3791/21062011015.jpg
  
  It looks like the GEOM event thread is stuck inside the cd(4) driver.  The
  cd(4) driver is trying to acquire the peripheral lock, and is sleeping
  until it gets it.
  
  What isn't clear is who is holding it.  The ps output shows an idle thread
  running on CPU 1, and thread 100014 (taskq) running on CPU 0.
  Unfortunately I don't see a stack trace for that.  (I might have missed
  it.)
  
  Do you happen to have the image with the stack trace for that thread?
 
 I don't have the image because no disks are mounted at that stage and the 
 swap slice is not attached. But I can issue more specific DDB commands to 
 narrow it down, just say what you need in detail.
 
 BTW, the machine have 2 DVD both are attached to Marvell IDE plain ATA 
 interface, they always works before.
 
 Are you sure that something holding the lock? 'show lock' shows absolutely 
 nothing, it is empty.

Well, after looking at the code a little more, it looks like the lock
that is being held is the periph lock, which is really just a flag.
So 'show lock' wouldn't show anything relevant.  Here's cam_periph_hold():

int
cam_periph_hold(struct cam_periph *periph, int priority)
{
int error;

/*
 * Increment the reference count on the peripheral
 * while we wait for our lock attempt to succeed
 * to ensure the peripheral doesn't disappear out
 * from user us while we sleep.
 */

if (cam_periph_acquire(periph) != CAM_REQ_CMP)
return (ENXIO);

mtx_assert(periph-sim-mtx, MA_OWNED);
while ((periph-flags  CAM_PERIPH_LOCKED) != 0) {
periph-flags |= CAM_PERIPH_LOCK_WANTED;
if ((error = mtx_sleep(periph, periph-sim-mtx, priority,
 caplck, 0)) != 0) {
cam_periph_release_locked(periph);
return (error);
}
}

periph-flags |= CAM_PERIPH_LOCKED;
return (0);
}

The GEOM event thread is stuck sleeping in the mtx_sleep() call above.  So
that tells me that one of several things is going on:

 - There is a path in the cd(4) driver where it can call cam_periph_hold()
   but not cam_periph_unhold().

 - There is another thread in the system that has called cam_periph_hold(),
   and has gotten stuck before it can call cam_periph_unhold().

 - The hold/unhold logic is broken, and there is a case where a thread
   waiting for the lock can miss the wakeup.  After looking at the code, I
   don't think this is the case, but I may have missed something.

So it is probably one of the first two cases.  From the dmesg, I only see
cd1 listed, not cd0.  So it is possible that cd0 is stuck in the probe code
somewhere, and the geom code just gets stuck trying to open it when the
probe hasn't completed.

Seeing the stack trace for the taskq thread that is running on CPU 0
(process 100014) might be enlightening, it's hard to say.  That may or may
not show the issue.

It's possible that this issue is directly related to the commit in
question; perhaps there is an error being returned that wasn't returned
before and it isn't being handled right in the cd(4) driver.  (The cd(4)
driver wasn't touched in the commit.)

It's also possible that the commit in question just changed the timing and
your system is hitting a race that was there previously.

Ken
-- 
Kenneth Merry
k...@freebsd.org
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-21 Thread Kenneth D. Merry
On Wed, Jun 22, 2011 at 00:49:34 +0400, Andrey Chernov wrote:
 On Tue, Jun 21, 2011 at 10:17:19AM -0600, Kenneth D. Merry wrote:
  ps
  alltrace
  show locks
  show msgbuf
  
  Hopefully that will give us something to start looking at...
  
  This would really work a lot better if there is any way to get a serial
  console on the machine.  The above will produce a good bit of output, and
  would likely need a lot of pictures.
  
  Since we can't reproduce the problem here, some debugging help would be
  greatly appreciated.
 
 Sorry I have no serial console. Here are the photos. I remove very similar 
 looking USB parts from 'ps' and 'alltrace', and very general parts from 
 'alltrace' always been there. I hope remaining info will be enough. USB 
 hotplagging works at this stage, so no reason to look there. If it will be 
 not enough, I'll upload whole series.

Thanks for uploading all of the photos.  That's a lot of work, but they are
helpful...

I think I see part of the problem, but not the whole problem:

 'show lock' outputs nothing, it means no locks just sleep somewhere 
 forever.
 
 'ps':
 http://img43.imageshack.us/img43/1424/21062011001j.jpg
 http://img835.imageshack.us/img835/6607/21062011002.jpg
 http://img841.imageshack.us/img841/5401/21062011003.jpg
 
 'alltrace':
 http://img864.imageshack.us/img864/6757/21062011004ya.jpg
 http://img542.imageshack.us/img542/4857/21062011005.jpg
 http://img828.imageshack.us/img828/823/21062011006.jpg
 http://img5.imageshack.us/img5/910/21062011007.jpg
 http://img7.imageshack.us/img7/4704/21062011008.jpg
 http://img848.imageshack.us/img848/5487/21062011009.jpg
 http://img641.imageshack.us/img641/2/21062011010.jpg
 http://img7.imageshack.us/img7/7946/21062011011.jpg
 http://img860.imageshack.us/img860/8185/21062011012.jpg
 http://img696.imageshack.us/img696/5276/21062011013.jpg

These two are interesting:

 http://img825.imageshack.us/img825/1249/21062011014m.jpg
 http://img839.imageshack.us/img839/3791/21062011015.jpg

It looks like the GEOM event thread is stuck inside the cd(4) driver.  The
cd(4) driver is trying to acquire the peripheral lock, and is sleeping
until it gets it.

What isn't clear is who is holding it.  The ps output shows an idle thread
running on CPU 1, and thread 100014 (taskq) running on CPU 0.
Unfortunately I don't see a stack trace for that.  (I might have missed
it.)

Do you happen to have the image with the stack trace for that thread?

 http://img594.imageshack.us/img594/1773/21062011016.jpg
 http://img109.imageshack.us/img109/9937/21062011017.jpg
 http://img51.imageshack.us/img51/6047/21062011018l.jpg
 
 'show msgbuf':
 http://img59.imageshack.us/img59/46/21062011019.jpg
 http://img189.imageshack.us/img189/483/21062011020.jpg
 http://img19.imageshack.us/img19/8163/21062011021.jpg
 http://img683.imageshack.us/img683/3171/21062011022.jpg
 http://img819.imageshack.us/img819/5923/21062011023.jpg
 http://img692.imageshack.us/img692/3789/21062011024.jpg
 http://img580.imageshack.us/img580/1550/21062011025.jpg
 http://img560.imageshack.us/img560/7478/21062011026.jpg
 http://img94.imageshack.us/img94/9371/21062011027.jpg
 http://img857.imageshack.us/img857/5185/21062011028.jpg

Thanks,

Ken
-- 
Kenneth Merry
k...@freebsd.org
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-21 Thread Andrey Chernov
On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
 These two are interesting:
 
  http://img825.imageshack.us/img825/1249/21062011014m.jpg
  http://img839.imageshack.us/img839/3791/21062011015.jpg
 
 It looks like the GEOM event thread is stuck inside the cd(4) driver.  The
 cd(4) driver is trying to acquire the peripheral lock, and is sleeping
 until it gets it.
 
 What isn't clear is who is holding it.  The ps output shows an idle thread
 running on CPU 1, and thread 100014 (taskq) running on CPU 0.
 Unfortunately I don't see a stack trace for that.  (I might have missed
 it.)
 
 Do you happen to have the image with the stack trace for that thread?

I don't have the image because no disks are mounted at that stage and the 
swap slice is not attached. But I can issue more specific DDB commands to 
narrow it down, just say what you need in detail.

BTW, the machine have 2 DVD both are attached to Marvell IDE plain ATA 
interface, they always works before.

Are you sure that something holding the lock? 'show lock' shows absolutely 
nothing, it is empty.

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-21 Thread Andrey Chernov
 Remove cd/acd from your kernel config to see if that allows
 you to boot?

I unplug DVDs physically and kernel finally boots!
BTW both DVDs was empty during the hanged boot and works normally under 
Win7.

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-21 Thread Kenneth D. Merry
On Mon, Jun 20, 2011 at 15:46:56 +0400, Andrey Chernov wrote:
 On Mon, Jun 20, 2011 at 11:01:46AM +0300, Kostik Belousov wrote:
  On Mon, Jun 20, 2011 at 11:02:22AM +0400, Andrey Chernov wrote:
   On Sun, Jun 19, 2011 at 08:15:43PM -0600, Justin T. Gibbs wrote:
On 6/19/11 6:19 PM, Andrey Chernov wrote:
 Exactly that commit is responsible for boot hang.
 Please fix.
 BTW, I have MBR on SATA disk (CAM emulated), ICH9.

Since it works for me, you'll need to provide more information.  Can you
at least drop into kdb to determine the likely source of the hang by
getting a stack trace of all processes to see where they are sleeping
and dumping lock information?
   
   I drop into DDB and put 'bt' console photo in the very first message of 
   this thread - nothing unusual seen in the main stack. Could you please 
   specify exact DDB commands you want to be issued by me? No dump can be 
   provided since nothing is mounted yet including swap,
   
   BTW, I remember I saw previously unseen warnings with post Jun 14 kernels:
   xpt_action_default: CCB type 0xe not supported
   
   'ps' inside DDB shows [xpt_thrd] at ccb_scan wmesg state and [g_event]
   at caplck wmesg state, [kernel] at g_waitid state.
   Even don't know, if it matters.
  
  Just in case, please try r223277.
 
 As the second message in the thread states, I try first even 223296 with 
 the same hang and the same 
 xpt_action_default: CCB type 0xe not supported
 As I think, DDB's 'ps' indicates that kernel waits something from geom and 
 geom waits something from ccb_scan forever, just raw guess. I will be glad to 
 issue more specific DDB commands and upload corresponding photos.
 BTW, pluging and unplugging USB devides works in that stage.

Can you do the following when the hang happens:

ps
alltrace
show locks
show msgbuf

Hopefully that will give us something to start looking at...

This would really work a lot better if there is any way to get a serial
console on the machine.  The above will produce a good bit of output, and
would likely need a lot of pictures.

Since we can't reproduce the problem here, some debugging help would be
greatly appreciated.

Thanks,

Ken
-- 
Kenneth Merry
k...@freebsd.org
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-21 Thread Will Andrews
Hi Andrey,

On Mon, Jun 20, 2011 at 5:46 AM, Andrey Chernov a...@freebsd.org wrote:
 As the second message in the thread states, I try first even 223296 with
 the same hang and the same
 xpt_action_default: CCB type 0xe not supported
 As I think, DDB's 'ps' indicates that kernel waits something from geom and
 geom waits something from ccb_scan forever, just raw guess. I will be glad to
 issue more specific DDB commands and upload corresponding photos.
 BTW, pluging and unplugging USB devides works in that stage.

Please try this patch:
http://people.freebsd.org/~will/patches/ata_xpt_add_advinfo.0.diff

Thanks.
--Will.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-21 Thread George Kontostanos
After applying the patch the system does not boot anymore!

It hangs after probing for scsi devices.

On Tue, Jun 21, 2011 at 8:58 PM, Will Andrews w...@firepipe.net wrote:

 Hi Andrey,

 On Mon, Jun 20, 2011 at 5:46 AM, Andrey Chernov a...@freebsd.org wrote:
  As the second message in the thread states, I try first even 223296 with
  the same hang and the same
  xpt_action_default: CCB type 0xe not supported
  As I think, DDB's 'ps' indicates that kernel waits something from geom
 and
  geom waits something from ccb_scan forever, just raw guess. I will be
 glad to
  issue more specific DDB commands and upload corresponding photos.
  BTW, pluging and unplugging USB devides works in that stage.

 Please try this patch:
 http://people.freebsd.org/~will/patches/ata_xpt_add_advinfo.0.diff

 Thanks.
 --Will.
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org




-- 
George Kontostanos
aisecure.net http://www.aisecure.net
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-21 Thread Andrey Chernov
On Tue, Jun 21, 2011 at 11:58:17AM -0600, Will Andrews wrote:
 Hi Andrey,
 
 On Mon, Jun 20, 2011 at 5:46 AM, Andrey Chernov a...@freebsd.org wrote:
  As the second message in the thread states, I try first even 223296 with
  the same hang and the same
  xpt_action_default: CCB type 0xe not supported
  As I think, DDB's 'ps' indicates that kernel waits something from geom and
  geom waits something from ccb_scan forever, just raw guess. I will be glad 
  to
  issue more specific DDB commands and upload corresponding photos.
  BTW, pluging and unplugging USB devides works in that stage.
 
 Please try this patch:
 http://people.freebsd.org/~will/patches/ata_xpt_add_advinfo.0.diff

The patch does not fix the hang I talk about, but diagnostic is removed.

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-21 Thread Andrey Chernov
On Tue, Jun 21, 2011 at 10:17:19AM -0600, Kenneth D. Merry wrote:
 ps
 alltrace
 show locks
 show msgbuf
 
 Hopefully that will give us something to start looking at...
 
 This would really work a lot better if there is any way to get a serial
 console on the machine.  The above will produce a good bit of output, and
 would likely need a lot of pictures.
 
 Since we can't reproduce the problem here, some debugging help would be
 greatly appreciated.

Sorry I have no serial console. Here are the photos. I remove very similar 
looking USB parts from 'ps' and 'alltrace', and very general parts from 
'alltrace' always been there. I hope remaining info will be enough. USB 
hotplagging works at this stage, so no reason to look there. If it will be 
not enough, I'll upload whole series.

'show lock' outputs nothing, it means no locks just sleep somewhere 
forever.

'ps':
http://img43.imageshack.us/img43/1424/21062011001j.jpg
http://img835.imageshack.us/img835/6607/21062011002.jpg
http://img841.imageshack.us/img841/5401/21062011003.jpg

'alltrace':
http://img864.imageshack.us/img864/6757/21062011004ya.jpg
http://img542.imageshack.us/img542/4857/21062011005.jpg
http://img828.imageshack.us/img828/823/21062011006.jpg
http://img5.imageshack.us/img5/910/21062011007.jpg
http://img7.imageshack.us/img7/4704/21062011008.jpg
http://img848.imageshack.us/img848/5487/21062011009.jpg
http://img641.imageshack.us/img641/2/21062011010.jpg
http://img7.imageshack.us/img7/7946/21062011011.jpg
http://img860.imageshack.us/img860/8185/21062011012.jpg
http://img696.imageshack.us/img696/5276/21062011013.jpg
http://img825.imageshack.us/img825/1249/21062011014m.jpg
http://img839.imageshack.us/img839/3791/21062011015.jpg
http://img594.imageshack.us/img594/1773/21062011016.jpg
http://img109.imageshack.us/img109/9937/21062011017.jpg
http://img51.imageshack.us/img51/6047/21062011018l.jpg

'show msgbuf':
http://img59.imageshack.us/img59/46/21062011019.jpg
http://img189.imageshack.us/img189/483/21062011020.jpg
http://img19.imageshack.us/img19/8163/21062011021.jpg
http://img683.imageshack.us/img683/3171/21062011022.jpg
http://img819.imageshack.us/img819/5923/21062011023.jpg
http://img692.imageshack.us/img692/3789/21062011024.jpg
http://img580.imageshack.us/img580/1550/21062011025.jpg
http://img560.imageshack.us/img560/7478/21062011026.jpg
http://img94.imageshack.us/img94/9371/21062011027.jpg
http://img857.imageshack.us/img857/5185/21062011028.jpg


-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-20 Thread Andrey Chernov
On Sun, Jun 19, 2011 at 08:15:43PM -0600, Justin T. Gibbs wrote:
 On 6/19/11 6:19 PM, Andrey Chernov wrote:
  Exactly that commit is responsible for boot hang.
  Please fix.
  BTW, I have MBR on SATA disk (CAM emulated), ICH9.
 
 Since it works for me, you'll need to provide more information.  Can you
 at least drop into kdb to determine the likely source of the hang by
 getting a stack trace of all processes to see where they are sleeping
 and dumping lock information?

I drop into DDB and put 'bt' console photo in the very first message of 
this thread - nothing unusual seen in the main stack. Could you please 
specify exact DDB commands you want to be issued by me? No dump can be 
provided since nothing is mounted yet including swap,

BTW, I remember I saw previously unseen warnings with post Jun 14 kernels:
xpt_action_default: CCB type 0xe not supported

'ps' inside DDB shows [xpt_thrd] at ccb_scan wmesg state and [g_event]
at caplck wmesg state, [kernel] at g_waitid state.
Even don't know, if it matters.

-- 
http://ache.vniz.net/
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-20 Thread Kostik Belousov
On Mon, Jun 20, 2011 at 11:02:22AM +0400, Andrey Chernov wrote:
 On Sun, Jun 19, 2011 at 08:15:43PM -0600, Justin T. Gibbs wrote:
  On 6/19/11 6:19 PM, Andrey Chernov wrote:
   Exactly that commit is responsible for boot hang.
   Please fix.
   BTW, I have MBR on SATA disk (CAM emulated), ICH9.
  
  Since it works for me, you'll need to provide more information.  Can you
  at least drop into kdb to determine the likely source of the hang by
  getting a stack trace of all processes to see where they are sleeping
  and dumping lock information?
 
 I drop into DDB and put 'bt' console photo in the very first message of 
 this thread - nothing unusual seen in the main stack. Could you please 
 specify exact DDB commands you want to be issued by me? No dump can be 
 provided since nothing is mounted yet including swap,
 
 BTW, I remember I saw previously unseen warnings with post Jun 14 kernels:
 xpt_action_default: CCB type 0xe not supported
 
 'ps' inside DDB shows [xpt_thrd] at ccb_scan wmesg state and [g_event]
 at caplck wmesg state, [kernel] at g_waitid state.
 Even don't know, if it matters.

Just in case, please try r223277.


pgp7CrzQfsgfV.pgp
Description: PGP signature


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-20 Thread Andrey Chernov
On Mon, Jun 20, 2011 at 11:01:46AM +0300, Kostik Belousov wrote:
 On Mon, Jun 20, 2011 at 11:02:22AM +0400, Andrey Chernov wrote:
  On Sun, Jun 19, 2011 at 08:15:43PM -0600, Justin T. Gibbs wrote:
   On 6/19/11 6:19 PM, Andrey Chernov wrote:
Exactly that commit is responsible for boot hang.
Please fix.
BTW, I have MBR on SATA disk (CAM emulated), ICH9.
   
   Since it works for me, you'll need to provide more information.  Can you
   at least drop into kdb to determine the likely source of the hang by
   getting a stack trace of all processes to see where they are sleeping
   and dumping lock information?
  
  I drop into DDB and put 'bt' console photo in the very first message of 
  this thread - nothing unusual seen in the main stack. Could you please 
  specify exact DDB commands you want to be issued by me? No dump can be 
  provided since nothing is mounted yet including swap,
  
  BTW, I remember I saw previously unseen warnings with post Jun 14 kernels:
  xpt_action_default: CCB type 0xe not supported
  
  'ps' inside DDB shows [xpt_thrd] at ccb_scan wmesg state and [g_event]
  at caplck wmesg state, [kernel] at g_waitid state.
  Even don't know, if it matters.
 
 Just in case, please try r223277.

As the second message in the thread states, I try first even 223296 with 
the same hang and the same 
xpt_action_default: CCB type 0xe not supported
As I think, DDB's 'ps' indicates that kernel waits something from geom and 
geom waits something from ccb_scan forever, just raw guess. I will be glad to 
issue more specific DDB commands and upload corresponding photos.
BTW, pluging and unplugging USB devides works in that stage.

-- 
http://ache.vniz.net/


pgpUHquEcbUw3.pgp
Description: PGP signature


Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

2011-06-19 Thread Justin T. Gibbs

On 6/19/11 6:19 PM, Andrey Chernov wrote:

Exactly that commit is responsible for boot hang.
Please fix.
BTW, I have MBR on SATA disk (CAM emulated), ICH9.


Since it works for me, you'll need to provide more information.  Can you
at least drop into kdb to determine the likely source of the hang by
getting a stack trace of all processes to see where they are sleeping
and dumping lock information?

Thanks,
Justin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org