Re: msk msk0 watchdog timeout freeze hang lock stop problem
On Wed, Apr 15, 2015 at 09:52:09PM +, Gareth Wyn Roberts wrote: I've inserted code to print some values which show the differences between specifying 4096 or 8192 for MSK_STAT_ALIGN. In both cases the status buffer has length 0x4000 (8x2048=16K) but the alignments are different as expected, respectively start addresses 0x5c3b000 or 0xbdc2c000. The following values were output from functions msk_status_dma_alloc(), msk_dmamap_cb() and msk_handle_events(). The Break #n refer to breaks in msk_handle_events(). #1 occurs if ((control HW_OWNER) == 0), #5 is OP_RXSTAT and #6 is OP_TXINDEXLE. The first output is for MSK_STAT_ALIGN=8192. It continues normally. Although not shown here, it reaches cons=2047 then cons=0 as expected. The second output is for MSK_STAT_ALIGN=4096. Although there can be isolated occurences of Break #1 (e.g. cons=196) (?are these to be expected?), it continues normally until cons=512. At this point it continually invokes the #1 block because the msk_control from msk_stat_ring[512] is always zero and the network hangs immediately. This suggests the Yukon Ultra 2 88E8057 can't access the next 4096 memory block, but why not? Yes, it seems the status LE block is not updated at all for MSK_STAT_ALIGN == 4096 and some elements of the status block looks suspicious(put index increases but the value in the location is 0). I vaguely guess this indicates there are DMA alignment and/or DMA boundary issues. The maximum number of elements of the status block is 4096 so the maximum size of the status block is 32KB. For i386, msk(4) uses 8KB status block(1024 elements). For 64bit architectures, the block size is increased to 16KB(2048 elements). Probably the safe alignment value for the status block would be 32K. This looks excessive value to me but it shall avoid guessing DMA boundary issue. Please let me know if any further information would be helpful. Thanks a lot. I've attached a diff which sets the alignment of TX/RX ring and status block to 32KB. Not sure whether this also addresses other msk(4) related watchdog timeouts. Index: sys/dev/msk/if_mskreg.h === --- sys/dev/msk/if_mskreg.h (revision 281587) +++ sys/dev/msk/if_mskreg.h (working copy) @@ -2175,13 +2175,8 @@ #define MSK_ADDR_LO(x) ((uint64_t) (x) 0xUL) #define MSK_ADDR_HI(x) ((uint64_t) (x) 32) -/* - * At first I guessed 8 bytes, the size of a single descriptor, would be - * required alignment constraints. But, it seems that Yukon II have 4096 - * bytes boundary alignment constraints. - */ -#define MSK_RING_ALIGN 4096 -#define MSK_STAT_ALIGN 4096 +#define MSK_RING_ALIGN 32768 +#define MSK_STAT_ALIGN 32768 /* Rx descriptor data structure */ struct msk_rx_desc { ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: [GEOM] Disk IO error when resyncing gmirror - massive hang in D state
Walter, thanks for your suggestions. to quickly answer: I' already evacuated data to the new drive (see the last paragraph of my original message). Luckily no critical data were on failed disk part, so rsync finished well the very first pass. The only question still actually open for me is why the kernel was stuck in geom, not returning read/write errors to the applications I'll try to collect lab machine with this drive (which is still by my work table) and reproduce the error. On Wed, 15 Apr 2015, Walter Cramer wrote: Here are a few ideas I had, if more capable people have not already sent you better ones: Copy as much important data as possible from the Toshiba drive, since it could degrade further or die at any time. Check whether a 'dd' command can quickly reproduce the error, so you can try things faster. If the failing drive is not fairly cold, try chilling it with a strong fan. Briefly put the drive in another system, to see if using a different power supply, controller, data cable, etc. would help. Changing the orientation (direction of gravity on the drive) might also be good. If nothing else helped, a tiny c language program might use open(), read(), lseek(), write(), etc. to copy all readable sectors to your replacement disk (using zeros for the unreadable bad sectors). -Walter On Tue, 14 Apr 2015, Dmitry Morozovsky wrote: Dear colleagues, unfortunately, the machine in question is in productin, so I have no clear reproduce case. I do have console logs, however. prerequisites: - rather fresh stable/10, amd64, SuperMicro MicroCloud 1150, X10SLD-F/HF - su+j ufs2 on top of gmirror of two SATA Toshiba drives - one disk died some time ago, so gmirror works in degraded state trouble: - inserted new drive, labelled, started gmirror resync - apparently remaining drive also has read issues: (ada0:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 10 b2 c3 40 01 00 00 01 00 00 (ada0:ahcich1:0:0:0): CAM status: ATA Status Error (ada0:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) (ada0:ahcich1:0:0:0): RES: 41 40 04 b3 c3 40 01 00 00 00 01 (ada0:ahcich1:0:0:0): Error 5, Retries exhausted GEOM_MIRROR: Request failed (error=5). ada0a[READ(offset=6566445056, length=131072)] GEOM_MIRROR: Synchronization request failed (error=5). mirror/m0a[READ(offset=6566445056, length=131072)] at this point, all requests to disk I/O are stalled, all cron jobs, syslogd, dchpd, etc. Situation reproduce itself at least two times, then as an emergency new drive had been labelled independently and rsynced over. Any thoughts? Thanks in advance! -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: ma...@freebsd.org ] *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru *** ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: ma...@freebsd.org ] *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru *** ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
auditd zombies in 10.1?
I notice that systems of ours which were recently upgraded to 10.1 are accumulating zombies at an alarming rate. (Well, alarming enough to cause me to be paged at 4 in the morning, at any rate!) These zombies are all children of auditd. Has anyone else seen or debugged this? -GAWollman ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: freebsd-update and hang during reboot
On Wed, Apr 15, 2015 at 02:44:44PM -0700, Nick Rogers wrote: On Mon, Mar 9, 2015 at 9:19 AM, Nick Rogers ncrog...@gmail.com wrote: Is anyone working on fixing this problem? It seems like this should have some kind of full court press as it is obviously affecting plenty of people, some of which have spoken up in the following PR https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195458 I realize its a tough problem to track down, and if I had the appropriate skills I would help. But so far all I've been able to do, like others, is replicate and complain about the problem. Its still affecting upgrading to 10.1-RELEASE-p6 from the official 10.1-RELEASE distribution, and from 10.1-RELEASE-p5. I just had another production server hang during reboot after updating to p6, and I don't see this changing for the inevitable p7 unless this problem gets more attention. Can someone with the right skill-set please help figure this out? Thank you. In case anyone is still dealing with this problem, the fix was MFC'd to stable/10 a few days. I am assuming this will not end up getting back ported to releng/10.1. An EN for 10.1-RELEASE is planned. Glen pgphKjzTB_th4.pgp Description: PGP signature