Re: [zfs-discuss] x4500 Thumper panic
Dumping to /dev/dsk/c6t0d0s1 certainly looks like a non-mirrored dump dev... You might try a manual savecore telling it to ignore the dump valid header and see what you get... savecore -d and perhaps try telling it to look directly at the dump device... savecore -f device You should also, when you get the chance, deliberately panic the box to make sure you can actually capture a dump... dumpadm is your friend as far as checking where you are going to dump to, and it it's one side of your swap mirror, that's bad, M'Kay? :) Nathan. Jorgen Lundman wrote: OK, this is a pretty damn poor panic report if I may say no, not had much sleep. Solaris Express Developer Edition 9/07 snv_70b X86 Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 August 2007 SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc Even though it dumped, it wrote nothing to /var/crash/. Perhaps because swap is mirrored. Jorgen Lundman wrote: We had a panic around noon on Saturday, which it mostly recovered itself. All ZFS NFS exports just remounted, but the UFS on zdev NFS exports did not, needed manual umount mount on all clients for some reason. Is this a known bug we should consider a patch for? May 10 11:49:46 x4500-01.unix ufs: [ID 912200 kern.notice] quota_ufs: over hard disk limit (pid 477, uid 127409, inum 1047211, fs /export/zero1) May 10 11:51:26 x4500-01.unix unix: [ID 836849 kern.notice] May 10 11:51:26 x4500-01.unix ^Mpanic[cpu3]/thread=17b8c820: May 10 11:51:26 x4500-01.unix genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ff001f4ca220 addr=0 occurred in module unknown due t o a NULL pointer dereference May 10 11:51:26 x4500-01.unix unix: [ID 10 kern.notice] May 10 11:51:26 x4500-01.unix unix: [ID 839527 kern.notice] nfsd: May 10 11:51:26 x4500-01.unix unix: [ID 753105 kern.notice] #pf Page fault May 10 11:51:26 x4500-01.unix unix: [ID 532287 kern.notice] Bad kernel fault at addr=0x0 May 10 11:51:26 x4500-01.unix unix: [ID 243837 kern.notice] pid=477, pc=0x0, sp= 0xff001f4ca318, eflags=0x10246 May 10 11:51:26 x4500-01.unix unix: [ID 211416 kern.notice] cr0: 8005003bpg,wp, ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de May 10 11:51:26 x4500-01.unix unix: [ID 354241 kern.notice] cr2: 0 cr3: 1fcbbc00 0 cr8: c May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rdi: fffedef ea000 rsi:9 rdx:0 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rcx: 17b 8c820 r8:0 r9: ff054797dc48 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rax: 0 rbx: 97eaffc rbp: ff001f4ca350 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] r10: 0 r11: fffec8b93868 r12: 27991000 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] r13: fffed1b 59c00 r14: fffecf8d8cc0 r15: 1000 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] fsb: 0 gsb: fffec3d5a580 ds: 4b May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] es: 4b fs:0 gs: 1c3 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] trp: e err: 10 rip:0 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] cs: 30 rfl:10246 rsp: ff001f4ca318 May 10 11:51:27 x4500-01.unix unix: [ID 266532 kern.notice] ss: 38 May 10 11:51:27 x4500-01.unix unix: [ID 10 kern.notice] May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca100 unix:die+c8 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca210 unix:trap+135b () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca220 unix:_cmntrap+e9 () May 10 11:51:27 x4500-01.unix genunix: [ID 802836 kern.notice] ff001f4ca350 0 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca3d0 ufs:top_end_sync+cb () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca440 ufs:ufs_fsync+1cb () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca490 genunix:fop_fsync+51 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca770 nfssrv:rfs3_create+604 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4caa70 nfssrv:common_dispatch+444 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4caa90 nfssrv:rfs_dispatch+2d () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4cab80 rpcmod:svc_getreq+1c6 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4cabf0
Re: [zfs-discuss] zfs data corruption
Note: IANATZD (I Am Not A Team-ZFS Dude) Speaking as a Hardware Guy, knowing that something is happening, has happened or is indicated to happen is a Good Thing (tm). Begin unlikely, but possible scenario: If, for instance, I'm getting a cluster of read errors (or, perhaps bad blocks), I could: - See it as it's happening - See the block number for each error - already know the rate at which the errors are happening - Be able to determine that it's not good, and it's time to replace the disk. - You get the picture... And based on this information, I could feel confident that I have the right information at hand to be able to determine that it is or is not time to replace this disk. Of course, that assumes: - I know anything about disks - I know anything about the error messages - I have some sort of logging tool that recognises the errors (and does not just throw out the 'retryable ones', as most I have seen are configured to do) - I care - The folks watching the logs in the enterprise management tool care - My storage even bothers to report the errors Certainly, for some organisations, all of the above are exactly how it works, and it works well for them. Looking at the ZFS/FMA approach, it certainly is somewhat different. The (very) rough concept is that FMA gets pretty much all errors reported to it. It logs them, in a persistent store, which is always available to view. It also makes diagnoses on the errors, based on the rules that exist for that particular style of error. Once enough (or the right type of) errors happen, it'll then make a Fault Diagnosis for that component, and log a message, loud and proud into the syslog. It may also take other actions, like, retire a page of memory, offline a CPU, panic the box, etc. So - That's the rough overview. It's worth noting up front that we can *observe* every event that has happened. Using fmdump and fmstat we can immediately see if anything interesting has been happening, or we can wait for a Fault Diagnosis, in which case, we can just watch /var/adm/messages. I also *believe* (though am not certain - Perhaps someone else on the list might be?) it would be possible to have each *event* (so - the individual events that lead to a Fault Diagnosis) generate a message if it was required, though I have never taken the time to do that one... There are many advantages to this approach - It does not rely on logfiles, offsets into logfiles, counters of previously processes messages and all of the other doom and gloom that comes with scraping logfiles. It's something you can simply ask: Any issues, chief? The answer is there in a flash. You will also be less likely to have the messages rolled out of the logs before you get to them (another classic...). And - You get some great details from fmdump showing you what's really going on, and it's something that's really easy to parse to look for patterns. All of this said, I understand if you feel things are being 'hidden' from you until it's *actually* busted that you are having some of your forward vision obscured 'in the name of a quiet logfile'. I felt much the same way for a period of time. (Though, I live more in the CPU / Memory camp...) But - Once I realised what I could do with fmstat and fmdump, I was not the slightest bit unhappy (Actually, that's not quite true... Even once I knew what they could do, it still took me a while to work out the options I cared about for fmdump / fmstat), but I now trust FMA to look after my CPU / Memory issues better than I would in real life. I can still get what I need when I want to, and the data is actually more accessible and interesting. I just needed to know where to go looking. All this being said, I was not actually aware that many of our disk / target drivers were actually FMA'd up yet. heh - Shows what I know. Does any of this make you feel any better (or worse)? Nathan. Mark A. Carlson wrote: fmd(1M) can log faults to syslogd that are already diagnosed. Why would you want the random spew as well? -- mark Carson Gaspar wrote: [EMAIL PROTECTED] wrote: It's not safe to jump to this conclusion. Disk drivers that support FMA won't log error messages to /var/adm/messages. As more support for I/O FMA shows up, you won't see random spew in the messages file any more. mode=large financial institution paying support customer That is a Very Bad Idea. Please convey this to whoever thinks that they're helping by not sysloging I/O errors. If this shows up in Solaris 11, we will Not Be Amused. Lack of off-box error logging will directly cause loss of revenue. /mode ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss
Re: [zfs-discuss] ZFS vs. Novell NSS
Hm - Based on this detail from the page: Change lever for switching between Rotation + Hammering , Neutral and Hammering only I'd hope it could still hammer... Though I'd suspect the size of nails it would hammer would be somewhat limited... ;) Nathan. Boyd Adamson wrote: Richard Elling [EMAIL PROTECTED] writes: Tim wrote: The greatest hammer in the world will be inferior to a drill when driving a screw :) The greatest hammer in the world is a rotary hammer, and it works quite well for driving screws or digging through degenerate granite ;-) Need a better analogy. Here's what I use (quite often) on the ranch: http://www.hitachi-koki.com/powertools/products/hammer/dh40mr/dh40mr.html Hasn't the greatest hammer in the world lost the ability to drive nails? I'll have to start belting them in with the handle of a screwdriver... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, iSCSI + Mac OS X Tiger (globalSAN iSCSI)
Hey there - This is very likely completely unrelated, but here goes anyhoo... I have noticed with some particular ethernet adapters (e1000g in my case) and large MTU sizes (8K) that things (most anything that really pushes the interface) sometimes stop for no good reason on my x86 Solaris boxes. After it stops, I'm able to re-connect after a short time and it works for a while again... (Really must get around to properly reproducing the problem and logging a bug too...) I'd be curious to know if setting the MTU to 1500 on both systems makes any difference at all. Note that I have only observed this with my super cheap adapters at home. I'm yet to see if (though also yet to try really hard) on the more expensive ones at work... Again - Likely nothing to do with your problem, but hey. It has made a difference for me before... Cheers. Nathan. George wrote: I have set up an iSCSI ZFS target that seems to connect properly from the Microsoft Windows initiator in that I can see the volume in MMC Disk Management. When I shift over to Mac OS X Tiger with globalSAN iSCSI, I am able to set up the Targets with the target name shown by `iscsitadm list target` and when I actually connect or Log On I see that one connection exists on the Solaris server. I then go on to the Sessions tab in globalSAN and I see the session details and it appears that data is being transferred via the PDUs Sent, PDUs Received, Bytes, etc. HOWEVER the connection then appears to terminate on the Solaris side if I check it a few minutes later it shows no connections, but the Mac OS X initiator still shows connected although no more traffic appears to be flowing in the Session Statistics dialog area. Additionally, when I then disconnect the Mac OS X initiator it seems to drop fine on the Mac OS X side, even though the Solaris side has shown it gone for a while, however when I reconnect or Log On again, it seems to spin infinitely on the Target Connect... dialog. Solaris is, interestingly, showing 1 connection while this apparent issue (spinning beachball of death) is going on with globalSAN. Even killing the Mac OS X process doesn't seem to get me full control again as I have to restart the system to kill all processes (unless I can hunt them down and `kill -9` them which I've not successfully done thus far). Has anyone dealt with this before and perhaps be able to assist or at least throw some further information towards me to troubleshoot this? Thanks much, -George ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss