Re: msk msk0 watchdog timeout freeze hang lock stop problem

2015-04-16 Thread Yonghyeon PYUN
On Wed, Apr 15, 2015 at 09:52:09PM +, Gareth Wyn Roberts wrote:
 I've inserted code to print some values which show the differences between 
 specifying 4096 or 8192 for MSK_STAT_ALIGN.  In both cases the status buffer 
 has length 0x4000 (8x2048=16K) but the alignments are different as expected, 
 respectively start addresses 0x5c3b000 or 0xbdc2c000.
 
 The following values were output from functions msk_status_dma_alloc(), 
 msk_dmamap_cb() and msk_handle_events().
 The Break #n refer to breaks in msk_handle_events(). #1 occurs if 
 ((control  HW_OWNER) == 0), #5 is OP_RXSTAT and #6 is OP_TXINDEXLE.
 
 The first output is for MSK_STAT_ALIGN=8192.  It continues normally.  
 Although not shown here, it reaches cons=2047 then cons=0 as expected.
 
 The second output is for MSK_STAT_ALIGN=4096.  Although there can be isolated 
 occurences of Break #1 (e.g. cons=196) (?are these to be expected?),  it 
 continues normally until cons=512. At this point it continually invokes the 
 #1 block because the msk_control from msk_stat_ring[512] is always zero and 
 the network hangs immediately. This suggests the Yukon Ultra 2 88E8057 can't 
 access the next 4096 memory block, but why not?
 

Yes, it seems the status LE block is not updated at all for
MSK_STAT_ALIGN == 4096 and some elements of the status block looks
suspicious(put index increases but the value in the location is 0).
I vaguely guess this indicates there are DMA alignment and/or DMA
boundary issues.
The maximum number of elements of the status block is 4096 so the
maximum size of the status block is 32KB.  For i386, msk(4) uses
8KB status block(1024 elements).  For 64bit architectures, the
block size is increased to 16KB(2048 elements).
Probably the safe alignment value for the status block would be
32K.  This looks excessive value to me but it shall avoid guessing
DMA boundary issue.

 Please let me know if any further information would be helpful.
 

Thanks a lot. I've attached a diff which sets the alignment of
TX/RX ring and status block to 32KB.  Not sure whether this also
addresses other msk(4) related watchdog timeouts.
Index: sys/dev/msk/if_mskreg.h
===
--- sys/dev/msk/if_mskreg.h	(revision 281587)
+++ sys/dev/msk/if_mskreg.h	(working copy)
@@ -2175,13 +2175,8 @@
 #define MSK_ADDR_LO(x)	((uint64_t) (x)  0xUL)
 #define MSK_ADDR_HI(x)	((uint64_t) (x)  32)
 
-/*
- * At first I guessed 8 bytes, the size of a single descriptor, would be
- * required alignment constraints. But, it seems that Yukon II have 4096
- * bytes boundary alignment constraints.
- */
-#define MSK_RING_ALIGN	4096
-#define	MSK_STAT_ALIGN	4096
+#define	MSK_RING_ALIGN	32768
+#define	MSK_STAT_ALIGN	32768
 
 /* Rx descriptor data structure */
 struct msk_rx_desc {
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: [GEOM] Disk IO error when resyncing gmirror - massive hang in D state

2015-04-16 Thread Dmitry Morozovsky
Walter,

thanks for your suggestions.

to quickly answer: I' already evacuated data to the new drive (see the last 
paragraph of my original message). Luckily no critical data were on failed disk 
part, so rsync finished well the very first pass.

The only question still actually open for me is why the kernel was stuck in 
geom, not returning read/write errors to the applications

I'll try to collect lab machine with this drive (which is still by my work 
table) and reproduce the error.



On Wed, 15 Apr 2015, Walter Cramer wrote:

 Here are a few ideas I had, if more capable people have not already sent you
 better ones:
 
 Copy as much important data as possible from the Toshiba drive, since it could
 degrade further or die at any time.
 
 Check whether a 'dd' command can quickly reproduce the error, so you can try
 things faster.
 
 If the failing drive is not fairly cold, try chilling it with a strong fan.
 
 Briefly put the drive in another system, to see if using a different power
 supply, controller, data cable, etc. would help.  Changing the orientation
 (direction of gravity on the drive) might also be good.
 
 If nothing else helped, a tiny c language program might use open(), read(),
 lseek(), write(), etc. to copy all readable sectors to your replacement disk
 (using zeros for the unreadable bad sectors).
 
 -Walter
 
 
 On Tue, 14 Apr 2015, Dmitry Morozovsky wrote:
 
  Dear colleagues,
  
  unfortunately, the machine in question is in productin, so I have no clear
  reproduce case. I do have console logs, however.
  
  prerequisites:
  - rather fresh stable/10, amd64, SuperMicro MicroCloud 1150, X10SLD-F/HF
  - su+j ufs2 on top of gmirror of two SATA Toshiba drives
  - one disk died some time ago, so gmirror works in degraded state
  
  trouble:
  - inserted new drive, labelled, started gmirror resync
  - apparently remaining drive also has read issues:
  (ada0:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 10 b2 c3 40 01 00 00 01
  00 00
  (ada0:ahcich1:0:0:0): CAM status: ATA Status Error
  (ada0:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
  (ada0:ahcich1:0:0:0): RES: 41 40 04 b3 c3 40 01 00 00 00 01
  (ada0:ahcich1:0:0:0): Error 5, Retries exhausted
  GEOM_MIRROR: Request failed (error=5). ada0a[READ(offset=6566445056,
  length=131072)]
  GEOM_MIRROR: Synchronization request failed (error=5).
  mirror/m0a[READ(offset=6566445056, length=131072)]
  
  at this point, all requests to disk I/O are stalled, all cron jobs, syslogd,
  dchpd, etc.
  
  Situation reproduce itself at least two times, then as an emergency new
  drive
  had been labelled independently and rsynced over.
  
  Any thoughts?
  
  Thanks in advance!
  
  
  -- 
  Sincerely,
  D.Marck [DM5020, MCK-RIPE, DM3-RIPN]
  [ FreeBSD committer: ma...@freebsd.org ]
  
  *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru ***
  
  ___
  freebsd-stable@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 

-- 
Sincerely,
D.Marck [DM5020, MCK-RIPE, DM3-RIPN]
[ FreeBSD committer: ma...@freebsd.org ]

*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru ***

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


auditd zombies in 10.1?

2015-04-16 Thread Garrett Wollman
I notice that systems of ours which were recently upgraded to 10.1 are
accumulating zombies at an alarming rate.  (Well, alarming enough to
cause me to be paged at 4 in the morning, at any rate!)  These zombies
are all children of auditd.  Has anyone else seen or debugged this?

-GAWollman

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: freebsd-update and hang during reboot

2015-04-16 Thread Glen Barber
On Wed, Apr 15, 2015 at 02:44:44PM -0700, Nick Rogers wrote:
 On Mon, Mar 9, 2015 at 9:19 AM, Nick Rogers ncrog...@gmail.com wrote:
  Is anyone working on fixing this problem? It seems like this should have
  some kind of full court press as it is obviously affecting plenty of
  people, some of which have spoken up in the following PR
 
  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195458
 
  I realize its a tough problem to track down, and if I had the appropriate
  skills I would help. But so far all I've been able to do, like others, is
  replicate and complain about the problem.
 
  Its still affecting upgrading to 10.1-RELEASE-p6 from the official
  10.1-RELEASE distribution, and from 10.1-RELEASE-p5. I just had another
  production server hang during reboot after updating to p6, and I don't see
  this changing for the inevitable p7 unless this problem gets more
  attention. Can someone with the right skill-set please help figure this
  out? Thank you.
 
 
 In case anyone is still dealing with this problem, the fix was MFC'd to
 stable/10 a few days. I am assuming this will not end up getting back
 ported to releng/10.1.

An EN for 10.1-RELEASE is planned.

Glen



pgphKjzTB_th4.pgp
Description: PGP signature