Re: [zfs-discuss] x4500 Thumper panic

2008-05-11 Thread Nathan Kroenert - Server ESG
Dumping to /dev/dsk/c6t0d0s1

certainly looks like a non-mirrored dump dev...

You  might try a manual savecore telling it to ignore the dump valid 
header and see what you get...

savecore -d

and perhaps try telling it to look directly at the dump device...

savecore -f device

You should also, when you get the chance, deliberately panic the box to 
make sure you can actually capture a dump...

dumpadm is your friend as far as checking where you are going to dump 
to, and it it's one side of your swap mirror, that's bad, M'Kay?

:)

Nathan.

Jorgen Lundman wrote:
 OK, this is a pretty damn poor panic report if I may say no, not had 
 much sleep.
 
  Solaris Express Developer Edition 9/07 snv_70b X86
 Copyright 2007 Sun Microsystems, Inc.  All Rights Reserved.
  Use is subject to license terms.
  Assembled 30 August 2007
 
 SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc
 
 Even though it dumped, it wrote nothing to /var/crash/. Perhaps because 
 swap is mirrored.
 
 
 
 Jorgen Lundman wrote:
 We had a panic around noon on Saturday, which it mostly recovered 
 itself. All ZFS NFS exports just remounted, but the UFS on zdev NFS 
 exports did not, needed manual umount  mount on all clients for some 
 reason.

 Is this a known bug we should consider a patch for?



 May 10 11:49:46 x4500-01.unix ufs: [ID 912200 kern.notice] quota_ufs:
 over hard
 disk limit (pid 477, uid 127409, inum 1047211, fs /export/zero1)
 May 10 11:51:26 x4500-01.unix unix: [ID 836849 kern.notice]
 May 10 11:51:26 x4500-01.unix ^Mpanic[cpu3]/thread=17b8c820:
 May 10 11:51:26 x4500-01.unix genunix: [ID 335743 kern.notice] BAD TRAP:
 type=e
 (#pf Page fault) rp=ff001f4ca220 addr=0 occurred in module
 unknown due t
 o a NULL pointer dereference
 May 10 11:51:26 x4500-01.unix unix: [ID 10 kern.notice]
 May 10 11:51:26 x4500-01.unix unix: [ID 839527 kern.notice] nfsd:
 May 10 11:51:26 x4500-01.unix unix: [ID 753105 kern.notice] #pf Page fault
 May 10 11:51:26 x4500-01.unix unix: [ID 532287 kern.notice] Bad kernel
 fault at
 addr=0x0
 May 10 11:51:26 x4500-01.unix unix: [ID 243837 kern.notice] pid=477,
 pc=0x0, sp=
 0xff001f4ca318, eflags=0x10246
 May 10 11:51:26 x4500-01.unix unix: [ID 211416 kern.notice] cr0:
 8005003bpg,wp,
 ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de
 May 10 11:51:26 x4500-01.unix unix: [ID 354241 kern.notice] cr2: 0 cr3:
 1fcbbc00
 0 cr8: c
 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rdi:
 fffedef
 ea000 rsi:9 rdx:0
 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rcx:
 17b
 8c820  r8:0  r9: ff054797dc48
 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rax:

  0 rbx:  97eaffc rbp: ff001f4ca350
 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] r10:

  0 r11: fffec8b93868 r12: 27991000
 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] r13:
 fffed1b
 59c00 r14: fffecf8d8cc0 r15: 1000
 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] fsb:

  0 gsb: fffec3d5a580  ds:   4b
 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice]  es:

 4b  fs:0  gs:  1c3
 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] trp:

  e err:   10 rip:0
 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice]  cs:

 30 rfl:10246 rsp: ff001f4ca318
 May 10 11:51:27 x4500-01.unix unix: [ID 266532 kern.notice]  ss:

 38
 May 10 11:51:27 x4500-01.unix unix: [ID 10 kern.notice]
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca100
 unix:die+c8 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca210
 unix:trap+135b ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca220
 unix:_cmntrap+e9 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 802836 kern.notice]
 ff001f4ca350
 0 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca3d0
 ufs:top_end_sync+cb ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca440
 ufs:ufs_fsync+1cb ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca490
 genunix:fop_fsync+51 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca770
 nfssrv:rfs3_create+604 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4caa70
 nfssrv:common_dispatch+444 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4caa90
 nfssrv:rfs_dispatch+2d ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4cab80
 rpcmod:svc_getreq+1c6 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4cabf0
 

Re: [zfs-discuss] zfs data corruption

2008-04-27 Thread Nathan Kroenert - Server ESG
Note: IANATZD (I Am Not A Team-ZFS Dude)

Speaking as a Hardware Guy, knowing that something is happening, has 
happened or is indicated to happen is a Good Thing (tm).

Begin unlikely, but possible scenario:

If, for instance, I'm getting a cluster of read errors (or, perhaps bad 
blocks), I could:
  - See it as it's happening
  - See the block number for each error
  - already know the rate at which the errors are happening
  - Be able to determine that it's not good, and it's time to replace 
the disk.
  - You get the picture...

And based on this information, I could feel confident that I have the 
right information at hand to be able to determine that it is or is not 
time to replace this disk.

Of course, that assumes:
  - I know anything about disks
  - I know anything about the error messages
  - I have some sort of logging tool that recognises the errors (and 
does not just throw out the 'retryable ones', as most I have seen are 
configured to do)
  - I care
  - The folks watching the logs in the enterprise management tool care
  - My storage even bothers to report the errors

Certainly, for some organisations, all of the above are exactly how it 
works, and it works well for them.

Looking at the ZFS/FMA approach, it certainly is somewhat different.

The (very) rough concept is that FMA gets pretty much all errors 
reported to it. It logs them, in a persistent store, which is always 
available to view. It also makes diagnoses on the errors, based on the 
rules that exist for that particular style of error. Once enough (or the 
right type of) errors happen, it'll then make a Fault Diagnosis for that 
component, and log a message, loud and proud into the syslog. It may 
also take other actions, like, retire a page of memory, offline a CPU, 
panic the box, etc.

So - That's the rough overview.

It's worth noting up front that we can *observe* every event that has 
happened. Using fmdump and fmstat we can immediately see if anything 
interesting has been happening, or we can wait for a Fault Diagnosis, in 
which case, we can just watch /var/adm/messages.

I also *believe* (though am not certain - Perhaps someone else on the 
list might be?) it would be possible to have each *event* (so - the 
individual events that lead to a Fault Diagnosis) generate a message if 
it was required, though I have never taken the time to do that one...

There are many advantages to this approach - It does not rely on 
logfiles, offsets into logfiles, counters of previously processes 
messages and all of the other doom and gloom that comes with scraping 
logfiles. It's something you can simply ask: Any issues, chief? The 
answer is there in a flash.

You will also be less likely to have the messages rolled out of the logs 
before you get to them (another classic...).

And - You get some great details from fmdump showing you what's really 
going on, and it's something that's really easy to parse to look for 
patterns.

All of this said, I understand if you feel things are being 'hidden' 
from you until it's *actually* busted that you are having some of your 
forward vision obscured 'in the name of a quiet logfile'. I felt much 
the same way for a period of time. (Though, I live more in the CPU / 
Memory camp...)

But - Once I realised what I could do with fmstat and fmdump, I was not 
the slightest bit unhappy (Actually, that's not quite true... Even once 
I knew what they could do, it still took me a while to work out the 
options I cared about for fmdump / fmstat), but I now trust FMA to look 
after my CPU / Memory issues better than I would in real life. I can 
still get what I need when I want to, and the data is actually more 
accessible and interesting. I just needed to know where to go looking.

All this being said, I was not actually aware that many of our disk / 
target drivers were actually FMA'd up yet. heh - Shows what I know.

Does any of this make you feel any better (or worse)?

Nathan.

Mark A. Carlson wrote:
 fmd(1M) can log faults to syslogd that are already diagnosed. Why
 would you want the random spew as well?
 
 -- mark
 
 Carson Gaspar wrote:
 [EMAIL PROTECTED] wrote:

   
 It's not safe to jump to this conclusion.  Disk drivers that support FMA
 won't log error messages to /var/adm/messages.  As more support for I/O
 FMA shows up, you won't see random spew in the messages file any more.
 

 mode=large financial institution paying support customer
 That is a Very Bad Idea. Please convey this to whoever thinks that 
 they're helping by not sysloging I/O errors. If this shows up in 
 Solaris 11, we will Not Be Amused. Lack of off-box error logging will 
 directly cause loss of revenue.
 /mode

   
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss 

Re: [zfs-discuss] ZFS vs. Novell NSS

2008-02-28 Thread Nathan Kroenert - Server ESG
Hm -

Based on this detail from the page:

Change lever for switching between Rotation
   + Hammering , Neutral and Hammering only

I'd hope it could still hammer... Though I'd suspect the size of nails 
it would hammer would be somewhat limited... ;)

Nathan.

Boyd Adamson wrote:
 Richard Elling [EMAIL PROTECTED] writes:
 Tim wrote:
 The greatest hammer in the world will be inferior to a drill when 
 driving a screw :)

 The greatest hammer in the world is a rotary hammer, and it
 works quite well for driving screws or digging through degenerate
 granite ;-)  Need a better analogy.
 Here's what I use (quite often) on the ranch:
 http://www.hitachi-koki.com/powertools/products/hammer/dh40mr/dh40mr.html
 
 Hasn't the greatest hammer in the world lost the ability to drive
 nails? 
 
 I'll have to start belting them in with the handle of a screwdriver...
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, iSCSI + Mac OS X Tiger (globalSAN iSCSI)

2007-07-04 Thread Nathan Kroenert - Server ESG
Hey there -

This is very likely completely unrelated, but here goes anyhoo...

I have noticed with some particular ethernet adapters (e1000g in my 
case) and large MTU sizes (8K) that things (most anything that really 
pushes the interface) sometimes stop for no good reason on my x86 
Solaris boxes. After it stops, I'm able to re-connect after a short time 
and it works for a while again... (Really must get around to properly 
reproducing the problem and logging a bug too...)

I'd be curious to know if setting the MTU to 1500 on both systems makes 
any difference at all.

Note that I have only observed this with my super cheap adapters at 
home. I'm yet to see if (though also yet to try really hard) on the more 
expensive ones at work...

Again - Likely nothing to do with your problem, but hey. It has made a 
difference for me before...

Cheers.

Nathan.


George wrote:
 I have set up an iSCSI ZFS target that seems to connect properly from 
 the Microsoft Windows initiator in that I can see the volume in MMC Disk 
 Management.
 
  
 When I shift over to Mac OS X Tiger with globalSAN iSCSI, I am able to 
 set up the Targets with the target name shown by `iscsitadm list target` 
 and when I actually connect or Log On I see that one connection exists 
 on the Solaris server.  I then go on to the Sessions tab in globalSAN 
 and I see the session details and it appears that data is being 
 transferred via the PDUs Sent, PDUs Received, Bytes, etc.  HOWEVER the 
 connection then appears to terminate on the Solaris side if I check it a 
 few minutes later it shows no connections, but the Mac OS X initiator 
 still shows connected although no more traffic appears to be flowing in 
 the Session Statistics dialog area.
 
  
 Additionally, when I then disconnect the Mac OS X initiator it seems to 
 drop fine on the Mac OS X side, even though the Solaris side has shown 
 it gone for a while, however when I reconnect or Log On again, it seems 
 to spin infinitely on the Target Connect... dialog.  Solaris is, 
 interestingly, showing 1 connection while this apparent issue (spinning 
 beachball of death) is going on with globalSAN.  Even killing the Mac OS 
 X process doesn't seem to get me full control again as I have to restart 
 the system to kill all processes (unless I can hunt them down and `kill 
 -9` them which I've not successfully done thus far).
 
 Has anyone dealt with this before and perhaps be able to assist or at 
 least throw some further information towards me to troubleshoot this?
 
  
 
  
 Thanks much,
 
  
 -George
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss