Re: Nightly disk-related panic since upgrade to 10.3

2016-10-21 Thread Mark Linimon
On Fri, Oct 21, 2016 at 10:14:26AM +0200, Andrea Venturoli wrote:
> I've tried this way, but altough I'm quite proficient with [k]gdb I tend to
> get lost in FreeBSD's kernel's source code, which, unfortunately, I'm not
> familiar with.
> 
> BTW, I had read that book years ago; I searched for it now, but a 2005
> edition still comes up. Has it ever been updated?

My usual go-to documentation John Baldwin's paper:

http://www.bsdcan.org/2008/schedule/attachments/45_article.pdf

mcl
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Nightly disk-related panic since upgrade to 10.3

2016-10-21 Thread Andrea Venturoli

On 10/20/16 22:12, Peter wrote:

Hello.




Basically You have two options: A) fire up kgdb, go into the code and
try and understand what exactly is happening. This depends
if You have clue enough to go that way; I found "man 4 gdb" and
especially the "Debugging Kernel Problems" pdf by Greg Lehey quite
helpful.


I've tried this way, but altough I'm quite proficient with [k]gdb I tend 
to get lost in FreeBSD's kernel's source code, which, unfortunately, I'm 
not familiar with.


BTW, I had read that book years ago; I searched for it now, but a 2005 
edition still comes up. Has it ever been updated?







B) systematically change parameters. Start by figuring from the logs
the exact time of crash and what was happening then, try to reproduce
that. Then change things and isolate the cause.


Again, I already tried, but without luck.

Since I had one hang one night during the creation of a snapshot, 
yesterday I tried creating/deleting around 40 of them: I hoped to get 
the system to hang again, but it all worked perfectly.


Since backups are run at night (possibly at the time of the hangs/panics 
and doing snapshots), I launched several backup jobs, but they all 
worked perfectly.


I checked that at the times of the panics there is usually no cron job, 
periodic job or whatever. At least not something I could identify.

There was in fact once a periodic running, but that's not the rule.
"ps -axl -M /var/crash/vmcore.x" showed nothing unusual.




 bye & Thanks
av.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Nightly disk-related panic since upgrade to 10.3

2016-10-20 Thread Peter

Andrea Venturoli wrote:

Hello.

Last week I upgraded a 9.3/amd64 box to 10.3: since then, it crashed and
rebooted at least once every night.


Hi,

  I have quite similar issue, crash dumps every night, but then my
stacktrace is different (crashing mostly in cam/scsi/scsi.c), and my
env is also quite different (old i386, individual disks, extensive use
of ZFS), so here is very likely a different reason. Also here the
upgrade is not the only change, I also replaced a burnt powersupply
recently and added an SSD cache.
Basically You have two options: A) fire up kgdb, go into the code and
try and understand what exactly is happening. This depends
if You have clue enough to go that way; I found "man 4 gdb" and
especially the "Debugging Kernel Problems" pdf by Greg Lehey quite
helpful.
B) systematically change parameters. Start by figuring from the logs
the exact time of crash and what was happening then, try to reproduce
that. Then change things and isolate the cause.

Having a RAID controller is a bit ugly in this regard, as it is more
or less a blackbox, and difficult to change parameters or swap
components.


The only exception was on Friday, when it locked without rebooting: it
still answered ping request and logins through HTTP would half work; I'm
under the impression that the disk subsystem was hung, so ICMP would
work since it does no I/O and HTTP too worked as far as no disk access
was required.


Yep. That tends to happen. It doesnt give much clue, except that there
is a disk related problem.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Nightly disk-related panic since upgrade to 10.3

2016-10-19 Thread Andrea Venturoli

Hello.

Last week I upgraded a 9.3/amd64 box to 10.3: since then, it crashed and 
rebooted at least once every night.


The only exception was on Friday, when it locked without rebooting: it 
still answered ping request and logins through HTTP would half work; I'm 
under the impression that the disk subsystem was hung, so ICMP would 
work since it does no I/O and HTTP too worked as far as no disk access 
was required.


Today I was able to get a couple of (almost identical) dumps:


cpuid = 1
KDB: stack backtrace:
#0 0x804ee170 at kdb_backtrace+0x60
#1 0x804b4576 at vpanic+0x126
#2 0x804b4443 at panic+0x43
#3 0x8068fd2a at softdep_deallocate_dependencies+0x6a
#4 0x805394b5 at brelse+0x145
#5 0x8053793c at bufwrite+0x3c
#6 0x806ae20f at ffs_write+0x3df
#7 0x8076d519 at VOP_WRITE_APV+0x149
#8 0x806ec7c9 at vnode_pager_generic_putpages+0x2a9
#9 0x8076f3b7 at VOP_PUTPAGES_APV+0xa7
#10 0x806ea6f5 at vnode_pager_putpages+0xc5
#11 0x806e17f8 at vm_pageout_flush+0xc8
#12 0x806db432 at vm_object_page_collect_flush+0x182
#13 0x806db1cd at vm_object_page_clean+0x13d
#14 0x806dadbe at vm_object_terminate+0x8e
#15 0x806eac60 at vnode_destroy_vobject+0x90
#16 0x806b4232 at ufs_reclaim+0x22
#17 0x8076e5c7 at VOP_RECLAIM_APV+0xa7




Has anyone any better insight on what might be going on?
The disks are all connected to a SAS RAID adapter running on mfi; I 
don't think it might be an hardware issue, since it has worked perfectly 
for years until I did the upgrade; also mfiutil says everything is ok 
and nothing mfi-related is in the logs.




Some ideas come to mind about which I might use a second opinion:

_ soft-update is broken: that would really surprise me, since I've been 
using that for years on this and several other boxes (10.3 too);


_ snapshot creation/deletion is causing this: again I'm using that 
almost anywhere, so I don't think this might be the cause alone; 
besides, I've been able to do some dumps without trouble and I don't 
think anything was messing with snapshots at the time of the last two 
panics;


_ mfi driver is broken on 10.3: this is more reasonable to me, since 
this is the only machine I have it on and it's the only case where I get 
this panics.
I found https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=183618, but I 
get no "g_vfs_done()..." messages.


Any other hint?



I'd really like to find out what's going on, I'll appreciate any help 
and I'm willing to provide any useful info.


On the other hand, this is a production server, so I have to solve this 
really soon.
Some idea comes to mind, like disabling softupdate (knowing which file 
system was having trouble would help here; is there any way to know?), 
trying to enable journaling, upgrading to 10-STABLE, build a kernel with 
INVARIANTS/WITNESS/etc..., but I'd appreciate a second opinion before I 
start shooting in the dark.




 bye & Thanks
av.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"