Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-04 Thread Simon Breden
I have moved this saga to storage-discuss now, as this doesn't appear to be a 
ZFS issue, and it can be found here:

http://www.opensolaris.org/jive/thread.jspa?threadID=59201
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-04 Thread Simon Breden
I have moved this saga to storage-discuss now, as this doesn't appear to be a 
ZFS issue, and it can be found here:

http://www.opensolaris.org/jive/thread.jspa?threadID=59201
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-03 Thread Simon Breden
Thanks Max, I have done a few tests with what you suggest and I have listed the 
output below. I wait a few minutes before deciding it's failed, and there is 
never any console output about anything failing, and nothing in any log files 
I've looked in: /var/adm/messages or /var/log/syslog. Maybe if I left it 2 
hours I might see a message somewhere, but who knows?

This is a nasty problem, as:
1. it appears to be failing on different files, although I think I'm seeing a 
common pattern where it fails on the third file often.
2. it copied all the files successfullly once, see log below for run #2, but 
then I do run #3 immediately afterwards and it fails again, so I list debug 
output for run #3
3. I cannot kill the hanging cp command and then my whole zfs filesystem is 
locked up meaning I have to reboot. Even doing an 'ls' command hangs often due 
to the hanging 'cp' command.
4. I cannot use 'shutdown -y -g 0 -i 5' to shutdown the machine, as it seems to 
be blocked by the hanging cp command
5. The way to shutdown the machine is to hit the reset button, and I don't like 
doing this when there are, theoretically, write operations occurring, or at 
least pending.

Anyway here is the long output. Perhaps if people reply they can avoid keeping 
this text as part of their reply or we'll be lost in a sea of dump output :)




output: run #1 (after reboot)

bash-3.2$ truss -topen cp -r testdir z
open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT
open(/lib/libc.so.1, O_RDONLY)= 3
open(/usr/lib/locale/en_GB.UTF-8/en_GB.UTF-8.so.3, O_RDONLY) = 3
open(/usr/lib/locale/common/methods_unicode.so.3, O_RDONLY) = 3
open(/lib/libsec.so.1, O_RDONLY)  = 3
open(/lib/libcmdutils.so.1, O_RDONLY) = 3
open(/lib/libavl.so.1, O_RDONLY)  = 3
open64(testdir/f06, O_RDONLY) = 4
open64(testdir/f15, O_RDONLY) = 4
open64(testdir/f12, O_RDONLY) = 4



# mdb -k
Loading modules: [ unix genunix specfs dtrace cpu.generic 
cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba 
s1394 nca lofs zfs random md sppp smbsrv nfs ptm crypto ipc ]
 ::pgrep cp
SPID   PPID   PGIDSIDUID  FLAGS ADDR NAME
R910909909869501 0x4a004000 ff01d96e6e30 cp
 ff01d96e6e30::walk thread | ::threadlist -v
ADDR PROC  LWP CLS PRIWCHAN
ff01d4371b60 ff01d96e6e30 ff01d9b28930   2  60 ff01f4aa52c0
  PC: _resume_from_idle+0xf1CMD: cp -r testdir z
  stack pointer for thread ff01d4371b60: ff000949f260
  [ ff000949f260 _resume_from_idle+0xf1() ]
swtch+0x17f()
cv_wait+0x61()
zio_wait+0x5f()
dmu_buf_hold_array_by_dnode+0x214()
dmu_read+0xd4()
zfs_fillpage+0x15e()
zfs_getpage+0x187()
fop_getpage+0x9f()
segvn_fault+0x9ef()
as_fault+0x5ae()
pagefault+0x95()
trap+0x1286()
0xfb8001d9()
fuword8+0x21()
zfs_write+0x147()
fop_write+0x69()
write+0x2af()
write32+0x1e()
sys_syscall32+0x101() 
  
 


bash-3.2$ iostat -xce 1
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk033.55.7  431.0   32.6  0.4  0.2   14.5   6  12   0   0   0   0   4 
 2  0 94
sd0   1.34.3  111.1   45.0  0.1  0.0   17.3   0   1   0   0   0   0 
sd1   2.04.4  210.0   45.1  0.1  0.0   14.5   0   1   0   0   0   0 
sd2   1.34.3  111.1   45.2  0.1  0.0   17.4   0   1   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   1 
48  0 51
sd0 325.6   56.3 36340.5  146.7 26.2  0.9   71.0  92  92   0   0   0   0 
sd1 518.6   49.2 65734.5  117.1 25.9  0.9   47.0  86  85   0   0   0   0 
sd2 327.6   57.3 36983.7  144.7 27.3  0.9   73.4  94  93   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   0 
43  0 57
sd0 301.11.0 33550.00.0 23.6  0.8   80.8  84  84   0   0   0   0 
sd1 556.21.0 69661.10.0 26.1  0.8   48.3  92  83   0   0   0   0 
sd2 300.15.0 33229.94.0 23.8  0.8   80.7  84  84   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device 

Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-03 Thread Simon Breden
Well, I had some more ideas and ran some more tests:

1. cp -r testdir ~/z1

This copied the testdir directory from the zfs pool into my home directory on 
the IDE boot drive, so not part of the zfs pool, and this worked.

2. cp -r ~/z1 .

This copied the files back from my home directory on the IDE boot disk and into 
the ZFS pool. This worked.

3. cp -r z1 z2 

This copied the files from the ZFS pool to another directory in the ZFS pool 
and this has not worked -- it hanged again, but differently this time. It 
copied a couple of files, then the hanged. The mouse wouldn't move, keyboard 
inactive, I hit loads of keys including ALT TAB and finally the mouse was back, 
the copying continued to copy 2 or 3 more files and then hanged again, this 
time no more files are being copied, and its hanged copying a different file 
from other times.


So from these tests, it appears that copying the test directory out of the ZFS 
pool is successful, copying it in from outside the pool is successful, but 
reading and writing the files completely within the pool is failing. My gut 
instinct is that reading and writing purely within the pool is stressing the 
disks due to more I/O demands on the disks used for the ZFS pool, but this may 
be completely wrong.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-03 Thread Simon Breden
The plot thickens. I replaced 'cp' with 'rsync' and it worked -- I ran it a few 
times and it didn't hang so far.

So on the face of it, it appears that 'cp' is doing something that causes my 
system to hang if the files are read from and written to the same pool, but 
simply replacing 'cp' with 'rsync' works. Hmmm... anyone have a clue about what 
I can do next to home in on the problem with 'cp' ?  

Here is the output using 'rsync' :

bash-3.2$ truss -topen rsync -a z1 z2
open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT
open(/lib/libsocket.so.1, O_RDONLY)   = 3
open(/lib/libnsl.so.1, O_RDONLY)  = 3
open(/lib/libc.so.1, O_RDONLY)= 3
open(/usr/lib/locale/en_GB.UTF-8/en_GB.UTF-8.so.3, O_RDONLY) = 3
open(/usr/lib/locale/common/methods_unicode.so.3, O_RDONLY) = 3
open64(/etc/popt, O_RDONLY)   Err#2 ENOENT
open64(/export/home/simon/.popt, O_RDONLY)Err#2 ENOENT
open(/usr/lib/iconv/UTF-8%UTF-8.so, O_RDONLY) = 3
open64(/var/run/name_service_door, O_RDONLY)  = 3
open64(z1/testdir/f01, O_RDONLY)  = 5
open64(z1/testdir/f02, O_RDONLY)  = 5
open64(z1/testdir/f03, O_RDONLY)  = 5
open64(z1/testdir/f04, O_RDONLY)  = 5
open64(z1/testdir/f05, O_RDONLY)  = 5
open64(z1/testdir/f06, O_RDONLY)  = 5
open64(z1/testdir/f07, O_RDONLY)  = 5
open64(z1/testdir/f08, O_RDONLY)  = 5
open64(z1/testdir/f09, O_RDONLY)  = 5
open64(z1/testdir/f10, O_RDONLY)  = 5
open64(z1/testdir/f11, O_RDONLY)  = 5
open64(z1/testdir/f12, O_RDONLY)  = 5
open64(z1/testdir/f13, O_RDONLY)  = 5
open64(z1/testdir/f14, O_RDONLY)  = 5
open64(z1/testdir/f15, O_RDONLY)  = 5
open64(z1/testdir/f16, O_RDONLY)  = 5
Received signal #18, SIGCLD, in pollsys() [caught]
  siginfo: SIGCLD CLD_EXITED pid=910 status=0x
bash-3.2$
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-03 Thread [EMAIL PROTECTED]
Hi Simon,
Simon Breden wrote:
 The plot thickens. I replaced 'cp' with 'rsync' and it worked -- I ran it a 
 few times and it didn't hang so far.

 So on the face of it, it appears that 'cp' is doing something that causes my 
 system to hang if the files are read from and written to the same pool, but 
 simply replacing 'cp' with 'rsync' works. Hmmm... anyone have a clue about 
 what I can do next to home in on the problem with 'cp' ?  

 Here is the output using 'rsync' :

 bash-3.2$ truss -topen rsync -a z1 z2
 open(/var/ld/ld.config, O_RDONLY)   Err#2 ENOENT
   
The rsync command and cp command work very differently.  cp mmaps up to 
8MB of the input file and writes
from the returned address of mmap, faulting in the pages as it writes  
(unless you are a normal user on Indiana,
in which case cp is gnu's cp which reads/writes (so, why are there 2 
versions?)).  Rsync forks and sets up a socketpair between parent and
child processes then reads/writes.  It should be much slower than cp, 
and put much less stress on the disk.
It would be great to have a way to reproduce this.   I have not had any 
problems.  How large is the
directory you are copying?  Either the disk has not sent a response to 
an I/O operation, or the response was
somehow lost.  If I could reproduce the problem, I might try to dtrace 
the commands being sent to the HBA
and responses coming back...  Hopefully someone here who has experience 
with the disks you are using
will be able to help.

max

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-03 Thread Rob
oops, I lied... according to my self

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-January/045141.html

wait are queued in solaris and active  1 are in
the drives NCQ.

so the question is: Where are the drive's command getting
dropped across 3 disks at the same time?

and in all cases its not a zfs issue, but a disk, controller
or [EMAIL PROTECTED] issue.

Rob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-03 Thread Simon Breden
Thanks Max, and the fact that rsync stresses the system less would help explain 
why rsync works, and cp hangs. The directory was around 11GB in size.

If Sun engineers are interested in this problem then I'm happy to run whatever 
commands they give me -- after all, I have a pure goldmine here for them to 
debug ;-) And it *is* running on a ZFS filesystem. Opportunities like this 
don't come along every day :) Tempted? :)

Well, if I can't tempt Sun, then for anyone who has the same disks, I would be 
interested to see what happens on your machine:
Model Number: WD7500AAKS-00RBA0
Firmware revision: 4G30

I use three of these disks in a RAIDZ1 vdev within the pool.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-03 Thread [EMAIL PROTECTED]
Hi Simon,
Simon Breden wrote:
 Thanks Max, and the fact that rsync stresses the system less would help 
 explain why rsync works, and cp hangs. The directory was around 11GB in size.

 If Sun engineers are interested in this problem then I'm happy to run 
 whatever commands they give me -- after all, I have a pure goldmine here for 
 them to debug ;-) And it *is* running on a ZFS filesystem. Opportunities like 
 this don't come along every day :) Tempted? :)

 Well, if I can't tempt Sun, then for anyone who has the same disks, I would 
 be interested to see what happens on your machine:
 Model Number: WD7500AAKS-00RBA0
 Firmware revision: 4G30

 I use three of these disks in a RAIDZ1 vdev within the pool.
  
   
I think Rob Logan is probably correct, and there is a problem with the 
disks, not zfs.  Have you
tried this with a different file system (ufs), or multiple dd commands 
running at the same time with
the raw disks?

max

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-03 Thread Dave
I have similar, but not exactly the same drives:

format inq
Vendor:   ATA
Product:  WDC WD7500AYYS-0
Revision: 4G30

Same firmware revision. I have no problems with drive performance, 
although I use them under UFS and for backing stores for iscsi disks.

FYI, I had random lockups and crashes on my Tyan MB with the MCP55 
chipset. I bought Supermicro AOL-SAT2-MV8's and moved all my disks to 
them. Haven't had a problem since.

http://de.opensolaris.org/jive/thread.jspa?messageID=204736

--
Dave

On 05/03/2008 01:44 PM, Simon Breden wrote:
 @Max: I've not tried this with other file systems, and not with multiple dd 
 commands at the same time with raw disks. I suppose this is not possible to 
 do with my disks which are currently part of this RAIDZ1 vdev in the pool 
 without corrupting data? I'll assume not.
 
 @Rob: OK, let's assume that, like you say, it's not a ZFS issue, but in fact 
 a drive, firmware etc issue. That said, where should I create a new thread -- 
 in storage-discuss ? I will refer to these 2 threads here for all the gory 
 details ;-)
 
 If this can be proven to be a disk problem then I want to return them under 
 warranty and get some different ones. Normally these disks have absolutely 
 excellent user feedback on newegg.com, so I'm quite surprised if the disks 
 are ALL bad.
 
 I wonder if, in fact, there could be some issue between the motherboard's 
 BIOS and the drives, or is this not possible? The motherboard is an Asus 
 M2N-SLI Deluxe and it uses an NVidia controller (MCP 55?) which is part of 
 the NVidia 570 SLI chipset, again not exactly an exotic, unused chipset.
 
 If it's possible for the BIOS to affect the disk interface then I have seen 
 that a new BIOS is available, which I could try.
 
 Also I could update to snv_b87, which is the latest one, although I first saw 
 this problem with that build (87), so the OS upgrade might not help.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-03 Thread Simon Breden
Wow, thanks Dave. Looks like you've had this hell too :)

So, that makes me happy that the disks and pool are probably OK, but it does 
seem an issue with the NVidia MCP 55 chipset, or at least perhaps the nv_sata 
driver. From reading the bug list below, it seems the problem might be a more 
general disk driver problem, perhaps not just limited to the nv_sata driver.

I looked at the post you listed, and followed a chain of bug reports:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6658565 (Accepted)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6642154 (dup of 
6662881)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6662881 (fixed in 
snv_87)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6669134 (fixed in 
snv_90)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6662400 (dup of 
6669134)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6671523 (Accepted)
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6672202 (Need more 
info)

Maybe I'll try snv_87 again, or just wait until snv_90 is released.

My work around for until this issue is fixed is to use 'rsync' to do intra-pool 
copies as it seems to stress the system less and thus prevents ZFS file system 
lockup.

Thanks to everyone who helped.

I might post a link to this thread in the storage-discuss group to see if I can 
get any further  help, if anyone there knows any more details on this.

That Supermicro AOC MV8 card looks good, but I would prefer not to have to buy 
new hardware to fix what should hopefully turn out to be a driver problem.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-02 Thread Simon Breden
OK, I tried replying by email, and got a message that a moderator will approve 
the message sometime... but that was a few hours ago, so I'm reverting to this 
forum software again :)

Here's the reply I emailed:

Hi Richard,

I ran the format comand, selected the number of one of the disks in the RAIDZ 
vdev and then selected the 'inquiry' subcommand, and here's the output:

format inquiry
Vendor:   ATA
Product:  WDC WD7500AAKS-0
Revision: 4G30
format

So it looks like the drive firmware is '4G30'

Now I will hunt around to see if there are any known problems with that 
firmware version. However, I already looked yesterday on the Western Digital 
site for these drives, model # WD7500AAKS, and I didn't see any firmware 
available to download...

I emailed Western Digital now to enquire about drive firmware for this drive 
model, so I'll see what they say...

Simon
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-02 Thread Simon Breden
Thanks Max,

I have not been able to find any new firmware for these drives (Western Digital 
WD7500AAKS) so I have sent an email to Western Digital to enquire about 
firmware updates. I'll see what they reply with, but I'm not too hopeful.

In the meantime I decided to copy the files one at a time until the problem 
re-occurred. I have found the file that failed, but of course, it's probably 
likely that it will not be the same file being copied each time it fails, so 
it's probably a pointless task, but I was just curious. I will embark on 
iteration 2 now...

Simon
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-02 Thread max
Hi Simon,

One quick note.  You don't have to cp each file one at a time to see  
which one it hangs on.  Just run truss.  It should be the last file  
that it opened.  To see this with truss, do:

truss cp -r ...

Don't worry about all the truss output.  You are probably only  
concerned with the end.  If the file is large and does not fail for  
many mmap/write calls, you can just truss opens and closes:

truss -topen,close cp -r ...

max

Quoting Simon Breden [EMAIL PROTECTED]:

 OK, I tried replying by email, and got a message that a moderator   
 will approve the message sometime... but that was a few hours ago,   
 so I'm reverting to this forum software again :)

 Here's the reply I emailed:

 Hi Richard,

 I ran the format comand, selected the number of one of the disks in   
 the RAIDZ vdev and then selected the 'inquiry' subcommand, and   
 here's the output:

 format inquiry
 Vendor:   ATA
 Product:  WDC WD7500AAKS-0
 Revision: 4G30
 format

 So it looks like the drive firmware is '4G30'

 Now I will hunt around to see if there are any known problems with   
 that firmware version. However, I already looked yesterday on the   
 Western Digital site for these drives, model # WD7500AAKS, and I   
 didn't see any firmware available to download...

 I emailed Western Digital now to enquire about drive firmware for   
 this drive model, so I'll see what they say...

 Simon


 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-02 Thread Rob Logan
or work around the NCQ bug in the drive's FW by typing:

su
echo set sata:sata_max_queue_depth = 0x1  /etc/system
reboot

Rob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-02 Thread Cindy . Swearingen
Simon,

I think you should review the checksum error reports from the fmdump
output (dated 4/30) that you supplied previously.

You can get more details by using fmdump -ev.

Use zpool status -v to identify checksum errors as well.

Cindy


Simon Breden wrote:
 Thanks Max,
 
 I have not been able to find any new firmware for these drives (Western 
 Digital WD7500AAKS) so I have sent an email to Western Digital to enquire 
 about firmware updates. I'll see what they reply with, but I'm not too 
 hopeful.
 
 In the meantime I decided to copy the files one at a time until the problem 
 re-occurred. I have found the file that failed, but of course, it's probably 
 likely that it will not be the same file being copied each time it fails, so 
 it's probably a pointless task, but I was just curious. I will embark on 
 iteration 2 now...
 
 Simon
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-02 Thread Simon Breden
Thanks Cindy.

Here's the zpool status -v output:

# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  raidz1ONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c1t0d0  ONLINE   0 0 0
c2t0d0  ONLINE   0 0 0

errors: No known data errors
# 

And here's the fmdump output:

# fmdump -ev
TIME CLASS ENA
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum   0x00bcf295d8f1
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum   0x00bcfb9991f00401
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum   0x00bcfb9991f00401
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum   0x00bcfb9991f00401
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum   0x00bcfb9991f00401
Apr 30 01:12:14.9896 ereport.fs.zfs.data   0x00bcfb9991f00401
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum   0x00bcf295d8f1
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum   0x00bcf295d8f1
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum   0x00bcf295d8f1
Apr 30 01:12:14.9802 ereport.fs.zfs.data   0x00bcf295d8f1
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum   0x00d3cb44fe900401
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum   0x00d3cb44fe900401
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum   0x00d3cb44fe900401
Apr 30 01:12:21.1130 ereport.fs.zfs.checksum   0x00d3cb44fe900401
Apr 30 01:12:21.1130 ereport.fs.zfs.data   0x00d3cb44fe900401
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum   0x00d5510ce6700401
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum   0x00d5510ce6700401
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum   0x00d5510ce6700401
Apr 30 01:12:21.5217 ereport.fs.zfs.checksum   0x00d5510ce6700401
Apr 30 01:12:21.5217 ereport.fs.zfs.data   0x00d5510ce6700401
# 

Simon
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-02 Thread Cindy . Swearingen
Okay, thanks.

I wanted to rule out that the checksum errors reported on 4/30
were persistent enough to be picked up by zpool status. ZFS is
generally quick to identify device problems.

Since fmdump doesn't show any add'l recent errors either, then I
think you can rule out hardware problems other than the slowness
problem.

You can generate more detailed fmdump output, but I don't think
that is necessary unless I missing some detail in the story so
far...

Cindy

Simon Breden wrote:
 Thanks Cindy.
 
 Here's the zpool status -v output:
 
 # zpool status -v tank
   pool: tank
  state: ONLINE
  scrub: none requested
 config:
 
   NAMESTATE READ WRITE CKSUM
   tankONLINE   0 0 0
 raidz1ONLINE   0 0 0
   c1t1d0  ONLINE   0 0 0
   c1t0d0  ONLINE   0 0 0
   c2t0d0  ONLINE   0 0 0
 
 errors: No known data errors
 # 
 
 And here's the fmdump output:
 
 # fmdump -ev
 TIME CLASS ENA
 Apr 30 01:12:14.9802 ereport.fs.zfs.checksum   0x00bcf295d8f1
 Apr 30 01:12:14.9896 ereport.fs.zfs.checksum   0x00bcfb9991f00401
 Apr 30 01:12:14.9896 ereport.fs.zfs.checksum   0x00bcfb9991f00401
 Apr 30 01:12:14.9896 ereport.fs.zfs.checksum   0x00bcfb9991f00401
 Apr 30 01:12:14.9896 ereport.fs.zfs.checksum   0x00bcfb9991f00401
 Apr 30 01:12:14.9896 ereport.fs.zfs.data   0x00bcfb9991f00401
 Apr 30 01:12:14.9802 ereport.fs.zfs.checksum   0x00bcf295d8f1
 Apr 30 01:12:14.9802 ereport.fs.zfs.checksum   0x00bcf295d8f1
 Apr 30 01:12:14.9802 ereport.fs.zfs.checksum   0x00bcf295d8f1
 Apr 30 01:12:14.9802 ereport.fs.zfs.data   0x00bcf295d8f1
 Apr 30 01:12:21.1130 ereport.fs.zfs.checksum   0x00d3cb44fe900401
 Apr 30 01:12:21.1130 ereport.fs.zfs.checksum   0x00d3cb44fe900401
 Apr 30 01:12:21.1130 ereport.fs.zfs.checksum   0x00d3cb44fe900401
 Apr 30 01:12:21.1130 ereport.fs.zfs.checksum   0x00d3cb44fe900401
 Apr 30 01:12:21.1130 ereport.fs.zfs.data   0x00d3cb44fe900401
 Apr 30 01:12:21.5217 ereport.fs.zfs.checksum   0x00d5510ce6700401
 Apr 30 01:12:21.5217 ereport.fs.zfs.checksum   0x00d5510ce6700401
 Apr 30 01:12:21.5217 ereport.fs.zfs.checksum   0x00d5510ce6700401
 Apr 30 01:12:21.5217 ereport.fs.zfs.checksum   0x00d5510ce6700401
 Apr 30 01:12:21.5217 ereport.fs.zfs.data   0x00d5510ce6700401
 # 
 
 Simon
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-02 Thread Simon Breden
OK then, thanks Cindy.

I have 2 current lines of investigation left (at least):

1. assumption the problem could relate to a drive firmware bug
2. there's a new BIOS for the motherboard available which might possibly have 
some effect

For the idea that it's a drive firmware bug, I'm currently trying to see how to 
disable NCQ...

Simon
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-02 Thread Simon Breden
Assuming that this problem could be related to a drive firmware bug I have:

1. tuned off NCQ -- or in fact limited the queue depth to 1
2. used truss with the cp command

I found this for NCQ: http://blogs.sun.com/erickustarz/entry/ncq_tunable

=

NCQ sata_max_queue_depth tunable

Previously, i did some analysis on NCQ in OpenSolaris. It turned out that to 
get good multi-stream sequential read performance, you had to disable NCQ via 
the tunable 'sata_func_enable'. Disabling NCQ actually does two things: 1) sets 
the number of concurrent I/Os to 1 and 2) changes what you send down protocol 
wise.

Turns out, the first is all we really need to get good performance for the 
multi-stream sequential read case, and doing the second actually exposes a bug 
on certain firmware of certain disks. So i highly recommend the newly added 
'sata_max_queue_depth' tunable instead of 'sata_func_enable'. As a reminder, 
put the following in /etc/system and reboot:

set sata:sata_max_queue_depth = 0x1

=

Anyway, after adding the line above into /etc/system, I rebooted and then 
re-tried the copy with truss:

truss cp -r testdir z4

It seems to hang on random files -- so it's not always the same file that it 
hangs on.

On this particular run here are the last few lines of truss output, although 
they're probably not useful:

write(5,  r8A pF2D9B69DCF96ADD2 O.., 8388608) = 8388608
mmap64(0xFE40, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 4, 0x2300) = 
0xFE40
write(5, C0A6 IF5 IB9A6 7 0F794 }.., 8388608) = 8388608
mmap64(0xFE40, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 4, 0x2380) = 
0xFE40
write(5, 8014D819B407 f ?F8F3E5FF.., 8388608) = 8388608
mmap64(0xFE40, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 4, 0x2400) = 
0xFE40
write(5,  q01DFD0 + % y7FB8 ^ AD.., 8388608) = 8388608
mmap64(0xFE40, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 4, 0x2480) = 
0xFE40

Any ideas what I can do next?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Simon Breden
Sorry for the delay. Here is the output for a couple of seconds:

# iostat -xce 1
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 1.50.7   20.84.2  0.0  0.09.0   0   1   0   0   0   0   2 
 1  0 98
sd0   0.30.2   29.4   25.1  0.1  0.0  174.5   0   0   0   0   0   0 
sd1   0.30.2   33.2   25.0  0.1  0.0  166.8   0   0   0   0   0   0 
sd2   0.20.2   26.8   24.8  0.1  0.0  180.3   0   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   0 
 0  0 100
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   1 
 1  0 98
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   2 
 1  0 98
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0  26 
 3  0 70
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   2 
 0  0 98
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.01.00.02.0  0.0  0.00.1   0   0   0   0   0   0   1 
 1  0 98
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0  18 
 1  0 81
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   6 
 1  0 94
sd0   0.00.00.00.0 35.0  0.00.0 100   

Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Richard Elling
Simon Breden wrote:
 Sorry for the delay. Here is the output for a couple of seconds:
   

This is the smoking gun...

 # iostat -xce 1
  extended device statistics  errors ---   
cpu
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  
 us sy wt id
 cmdk0 1.50.7   20.84.2  0.0  0.09.0   0   1   0   0   0   0   
 2  1  0 98
 sd0   0.30.2   29.4   25.1  0.1  0.0  174.5   0   0   0   0   0   0 
 sd1   0.30.2   33.2   25.0  0.1  0.0  166.8   0   0   0   0   0   0 
 sd2   0.20.2   26.8   24.8  0.1  0.0  180.3   0   0   0   0   0   0 
 sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
   

I/O is moving, but not much activity...

  extended device statistics  errors ---   
cpu
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  
 us sy wt id
 cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   
 0  0  0 100
 sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
 sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
 sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
 sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
   

I/O is stuck at the device.  %w[ait] of 100 means that something is
stuck on the device.  This is probably not a ZFS issue, per se, but
somehow the ZFS workload triggers it.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Rob Logan

hmm, three drives with 35 io requests in the queue
and none active? remind me not to buy a drive
with that FW..

1) upgrade the FW in the drives or

2) turn off NCQ with:
echo set sata:sata_max_queue_depth = 0x1  /etc/system


Rob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Simon Breden
Thanks a lot Richard. To give a bit more info, I've copied my /var/adm/messages 
from booting up the machine:

And @picker: I guess the 35 requests are stacked up waiting for the hanging 
request to be serviced?

The question I have is where do I go from now, to get some more info on what is 
causing cp to have problems.

I will now try another tack: use rsync to copy the directory to a disk outside 
the pool (i.e. my home directory on the boot drive), to see if it is happy 
doing that.


May  1 17:04:15 zfsbox su: [ID 810491 auth.crit] 'su root' failed for simon on 
/dev/pts/5
May  1 17:48:15 zfsbox genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 
Version snv_85 64-bit
May  1 17:48:15 zfsbox genunix: [ID 172908 kern.notice] Copyright 1983-2008 Sun 
Microsystems, Inc.  All rights reserved.
May  1 17:48:15 zfsbox Use is subject to license terms.
May  1 17:48:15 zfsbox unix: [ID 126719 kern.info] features: 
13f6fffcpuid,tscp,cmp,cx16,sse3,nx,asysc,sse2,sse,pat,cx8,pae,mca,mmx,cmov,de,pge,mtrr,msr,tsc,lgpg
May  1 17:48:15 zfsbox unix: [ID 168242 kern.info] mem = 4192764K (0xffe7f000)
May  1 17:48:15 zfsbox rootnex: [ID 466748 kern.info] root nexus = i86pc
May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] pseudo0 at root
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] pseudo0 is /pseudo
May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] scsi_vhci0 at root
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] scsi_vhci0 is /scsi_vhci
May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] isa0 at root
May  1 17:48:15 zfsbox pci_autoconfig: [ID 139057 kern.info] NOTICE: reprogram 
io-range on ppb[0/f/0]: 0x2000 ~ 0x2fff
May  1 17:48:15 zfsbox pci_autoconfig: [ID 595143 kern.info] NOTICE: add 
io-range on subtractive ppb[0/6/0]: 0x3000 ~ 0x3fff
May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local nmi: 
0 1 1 1
May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local nmi: 
1 1 1 1
May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local nmi: 
2 1 1 1
May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local nmi: 
3 1 1 1
May  1 17:48:15 zfsbox pcplusmp: [ID 177709 kern.info] pcplusmp: vector 0x9 
ioapic 0x4 intin 0x9 is bound to cpu 1
May  1 17:48:15 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: acpippm0
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] acpippm0 is 
/pseudo/[EMAIL PROTECTED]
May  1 17:48:15 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: ppm0
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] ppm0 is /pseudo/[EMAIL 
PROTECTED]
May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] npe0 at root: space 0 
offset 0
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] npe0 is /[EMAIL 
PROTECTED],0
May  1 17:48:15 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: ide (ata) 
instance 0 vector 0xe ioapic 0x4 intin 0xe is bound to cpu 0
May  1 17:48:15 zfsbox genunix: [ID 640982 kern.info]   IDE device at targ 0, 
lun 0 lastlun 0x0
May  1 17:48:15 zfsbox genunix: [ID 846691 kern.info]   model HDS722516VLAT80
May  1 17:48:15 zfsbox genunix: [ID 479077 kern.info]   ATA/ATAPI-6 supported, 
majver 0x7c minver 0x19
May  1 17:48:15 zfsbox genunix: [ID 640982 kern.info]   ATAPI device at targ 1, 
lun 0 lastlun 0x0
May  1 17:48:15 zfsbox genunix: [ID 846691 kern.info]   model TSSTcorpDVD-ROM 
SH-D162C
May  1 17:48:15 zfsbox npe: [ID 236367 kern.info] PCI Express-device: [EMAIL 
PROTECTED], ata0
May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] ata0 is /[EMAIL 
PROTECTED],0/[EMAIL PROTECTED]/[EMAIL PROTECTED]
May  1 17:48:15 zfsbox genunix: [ID 773945 kern.info]   UltraDMA mode 2 selected
May  1 17:48:15 zfsbox genunix: [ID 773945 kern.info]   UltraDMA mode 5 selected
May  1 17:48:15 zfsbox gda: [ID 243001 kern.info] Disk0:Vendor 
'Gen-ATA ' Product 'HDS722516VLAT80 '
May  1 17:48:17 zfsbox ata: [ID 496167 kern.info] cmdk0 at ata0 target 0 lun 0
May  1 17:48:17 zfsbox genunix: [ID 936769 kern.info] cmdk0 is /[EMAIL 
PROTECTED],0/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
May  1 17:48:20 zfsbox unix: [ID 190185 kern.info] SMBIOS v2.4 loaded (2169 
bytes)
May  1 17:48:20 zfsbox genunix: [ID 408114 kern.info] /cpus (cpunex0) online
May  1 17:48:20 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: dld0
May  1 17:48:20 zfsbox genunix: [ID 936769 kern.info] dld0 is /pseudo/[EMAIL 
PROTECTED]
May  1 17:48:20 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: 
pciexclass,060400 (pcie_pci) instance 1 vector 0x18 ioapic 0xff intin 0xff is 
bound to cpu 1
May  1 17:48:20 zfsbox npe: [ID 236367 kern.info] PCI Express-device: 
pci10de,[EMAIL PROTECTED], pcie_pci1
May  1 17:48:20 zfsbox genunix: [ID 936769 kern.info] pcie_pci1 is /[EMAIL 
PROTECTED],0/pci10de,[EMAIL PROTECTED]
May  1 17:48:21 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: pci10de,163 
(nvidia) instance 0 vector 0x10 ioapic 0x4 intin 0x10 is bound to cpu 0
May  1 17:48:21 zfsbox pcie_pci: [ID 586369 kern.info] PCIE-device: 

Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread [EMAIL PROTECTED]
Hi Simon,

Simon Breden wrote:
 Thanks a lot Richard. To give a bit more info, I've copied my 
 /var/adm/messages from booting up the machine:

 And @picker: I guess the 35 requests are stacked up waiting for the hanging 
 request to be serviced?

 The question I have is where do I go from now, to get some more info on what 
 is causing cp to have problems.

 I will now try another tack: use rsync to copy the directory to a disk 
 outside the pool (i.e. my home directory on the boot drive), to see if it is 
 happy doing that.
   
What does truss show the cp doing? 
max


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Simon Breden
This list seems out of sync (delayed) with email messages I receive.

Why is that?

Which are the best tools to use when reading / replying to these posts? 

Anyway from my email I can see that Max has sent me a question about truss -- 
here is my reply:

Hi Max,

I haven't used truss before, but give me the command line + switches and I'll 
be happy to run it.

Simon
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread [EMAIL PROTECTED]
Hi Simon,


Simon Breden wrote:
 Hi Max,

 I haven't used truss before, but give me the command line + switches 
 and I'll be happy to run it.

 Simon
# truss -p pid_from_cp

where pid_from_cp is... the pid of the cp process that is hung.  The 
pid you can get from ps.

I am curious if the cp is stuck on a specific file, or is just very 
slow, or is hung in the kernel.
Also, can you kill the cp when it hangs?

thanks,
max

 2008/5/1 [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
 [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]:

 Hi Simon,

 Simon Breden wrote:

 Thanks a lot Richard. To give a bit more info, I've copied my
 /var/adm/messages from booting up the machine:

 And @picker: I guess the 35 requests are stacked up waiting
 for the hanging request to be serviced?

 The question I have is where do I go from now, to get some
 more info on what is causing cp to have problems.

 I will now try another tack: use rsync to copy the directory
 to a disk outside the pool (i.e. my home directory on the boot
 drive), to see if it is happy doing that.
  

 What does truss show the cp doing? max




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Simon Breden
This mailing list seems broken and out of sync -- your post is as 'Guest' and 
appears as a new post in the main zfs-discuss list -- and the main thread is 
out of sync with the replies, and I just got a java exception trying to post to 
the main thread -- what's going on here?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Simon Breden
Hi Max,

I re-ran the cp command and when it hanged I ran 'ps -el' looked up the cp 
command, got it's PID and then ran:

# truss -p PID_of_cp

and it output nothing at all -- i.e. it hanged too -- just showing a flashing 
cursor.

The system is still operational as I am typing into the browser.

Before I ran the cp command I did a 'tail -f /var/adm/messages' and there is no 
output. I also did a 'tail -f /var/log/syslog' and there is also no output.

If I try 'kill -15 PID_of_cp' and then 'ps -el' cp is still running.
And if I try 'kill -9 PID_of_cp' and then 'ps -el' cp is still running.

What next ?

For what it's worth, here is further output from iostat, the first line looks 
interesting, but it appears to be buffered, as it's the same each time I run it 
-- possibly the last info from cp before it died???  (see below)

So, any ideas what I can do next? Just going to reboot to kill the hanging cp 
-- wow, this is like Windows now :(  (attempt to keep humor)

bash-3.2$ iostat -xce 1
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk021.42.7  191.4   32.4  0.4  0.1   20.8   3   5   0   0   0   0   4 
 2  0 94
sd0   9.9   12.6 1193.8 1478.2 11.0  0.0  489.0  32   4   0   0   0   0 
sd1   7.5   12.6  836.2 1479.1 10.9  0.0  543.7  32   4   0   0   0   0 
sd2   7.7   12.6  877.1 1479.4 10.9  0.0  537.1  32   4   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   1 
 1  0 98
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   2 
 0  0 98
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   1 
 1  0 98
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   2 
 0  0 98
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
 extended device statistics  errors --- 
 cpu
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot  us 
sy wt id
cmdk0 0.00.00.00.0  0.0  0.00.0   0   0   0   0   0   0   4 
 1  0 96
sd0   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd1   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd2   0.00.00.00.0 35.0  0.00.0 100   0   0   0   0   0 
sd3   0.00.00.00.0  0.0  0.00.0   0   0   0   5   0   5 
^C
bash-3.2$ 



and I just saw this:  (seems to be from yesterday -- I don't know if it's 
relevant to this though ??)



# fmdump -e
TIME CLASS
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum 
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum 
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum 
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum 
Apr 30 01:12:14.9896 ereport.fs.zfs.checksum 
Apr 30 01:12:14.9896 ereport.fs.zfs.data 
Apr 30 01:12:14.9802 ereport.fs.zfs.checksum 
Apr 30 

Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread [EMAIL PROTECTED]
Hi Simon,

Simon Breden wrote:
 Hi Max,

 I re-ran the cp command and when it hanged I ran 'ps -el' looked up the cp 
 command, got it's PID and then ran:

 # truss -p PID_of_cp

 and it output nothing at all -- i.e. it hanged too -- just showing a flashing 
 cursor.

 The system is still operational as I am typing into the browser.

 Before I ran the cp command I did a 'tail -f /var/adm/messages' and there is 
 no output. I also did a 'tail -f /var/log/syslog' and there is also no output.

 If I try 'kill -15 PID_of_cp' and then 'ps -el' cp is still running.
 And if I try 'kill -9 PID_of_cp' and then 'ps -el' cp is still running.

 What next ?
   
You can try the following:

# mdb -k
::pgrep cp   -- this should give you a line with the cp you are 
running.  Next to cp is an address, use this address in the next line:

address_from_pgrep::walk thread | ::threadlist -v

This will give you a stack trace.  Please post it.

$q  -- this gets you out of mdb

max

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Simon Breden
Keep getting Java exceptions posting to the proper thread for this -- just lost 
an hour --- WTF???

Had to reply to my own post as Max's reply (which I saw in my email inbox) has 
not appeared here. Again, what is wrong with this forum software -- it seems so 
buggy, or am I missing something here (I'm quite prepared to be wrong, but I'm 
getting really p*ssed off with this forum engine as I'm wasting time and 
getting Java errors about missing parent which I will post if I get more errors 
trying to post with this broken forum software -- if it allows me to)


Thanks for your advice Max, and here is my reply to your suggestion:


# mdb -k
Loading modules: [ unix genunix specfs dtrace cpu.generic 
cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba 
s1394 nca lofs zfs random md sppp smbsrv nfs ptm ipc crypto ]
 ::pgrep cp
SPID   PPID   PGIDSIDUID  FLAGS ADDR NAME
R889868889868501 0x4a004000 ff01deca9048 cp
 ff01deca9048::walk thread | ::threadlist -v
ADDR PROC  LWP CLS PRIWCHAN
ff01e0045840 ff01deca9048 ff01de9d9210   2  60 ff01d861ca80
  PC: _resume_from_idle+0xf1CMD: cp -pr testdir z1
  stack pointer for thread ff01e0045840: ff0007fcdf00
  [ ff0007fcdf00 _resume_from_idle+0xf1() ]
swtch+0x17f()
cv_wait+0x61()
zio_wait+0x5f()
dbuf_read+0x1b5()
dbuf_findbp+0xe8()
dbuf_prefetch+0x9b()
dmu_zfetch_fetch+0x43()
dmu_zfetch_dofetch+0xc2()
dmu_zfetch_find+0x3a1()
dmu_zfetch+0xa5()
dbuf_read+0xe3()
dmu_buf_hold_array_by_dnode+0x1c4()
dmu_read+0xd4()
zfs_fillpage+0x15e()
zfs_getpage+0x187()
fop_getpage+0x9f()
segvn_fault+0x9ef()
as_fault+0x5ae()
pagefault+0x95()
trap+0x1286() 
0xfb8001d9()  
fuword8+0x21()
zfs_write+0x147() 
fop_write+0x69()  
write+0x2af() 
write32+0x1e()
sys_syscall32+0x101() 
  

 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Simon Breden
Just to reduce my stress levels and to give the webmaster some useful info to 
help fix this broken forum:

I tried posting a reply to the main thread for 'cp -r hanged copying a 
directory' and got the following error -- seems like it can't find the parent 
thread/message's id in the database at a guess:

Got this multiple times when trying to post, and as no information is given on 
how to contact teh system administrator in the message -- here is the message 
and exception stack trace:


 An error in the system has occurred. Please contact the system administrator 
if the problem persists.

type: java.lang.IllegalArgumentException

java.lang.IllegalArgumentException: Parent key 230968 not found when adding 
child 231029.
at com.jivesoftware.util.LongTree.addChild(Unknown Source)
at com.jivesoftware.forum.database.DbTreeWalker.addChild(Unknown Source)
at com.jivesoftware.forum.database.DbForumThread.addMessage(Unknown 
Source)
at com.jivesoftware.forum.proxy.ForumThreadProxy.addMessage(Unknown 
Source)
at org.opensolaris.jive.action.ForumPostAction.createMessage(Unknown 
Source)
at com.jivesoftware.forum.action.PostAction.execute(Unknown Source)
at com.jivesoftware.forum.action.PostAction.doPost(Unknown Source)
at sun.reflect.GeneratedMethodAccessor881.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at 
com.opensymphony.xwork.DefaultActionInvocation.invokeAction(DefaultActionInvocation.java:300)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:166)
at 
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.jivesoftware.forum.action.JiveExceptionInterceptor.intercept(Unknown Source)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.jivesoftware.base.action.JiveObjectLoaderInterceptor.intercept(Unknown 
Source)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.opensymphony.webwork.interceptor.FileUploadInterceptor.intercept(FileUploadInterceptor.java:71)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.opensymphony.xwork.interceptor.AroundInterceptor.intercept(AroundInterceptor.java:35)
at 
com.opensymphony.xwork.DefaultActionInvocation.invoke(DefaultActionInvocation.java:164)
at 
com.opensymphony.xwork.DefaultActionProxy.execute(DefaultActionProxy.java:116)
at 
com.opensymphony.webwork.dispatcher.ServletDispatcher.serviceAction(ServletDispatcher.java:272)
at com.jivesoftware.base.util.JiveWebWorkServlet.service(Unknown Source)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at sun.reflect.GeneratedMethodAccessor156.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at 
org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
at 
org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:275)
at 
org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)
at 

Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread [EMAIL PROTECTED]
Hi Simon,
Simon Breden wrote:

 Thanks for your advice Max, and here is my reply to your suggestion:


 # mdb -k
 Loading modules: [ unix genunix specfs dtrace cpu.generic 
 cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba 
 s1394 nca lofs zfs random md sppp smbsrv nfs ptm ipc crypto ]
   
 ::pgrep cp
 
 SPID   PPID   PGIDSIDUID  FLAGS ADDR NAME
 R889868889868501 0x4a004000 ff01deca9048 cp
   
 ff01deca9048::walk thread | ::threadlist -v
 
 ADDR PROC  LWP CLS PRIWCHAN
 ff01e0045840 ff01deca9048 ff01de9d9210   2  60 ff01d861ca80
   PC: _resume_from_idle+0xf1CMD: cp -pr testdir z1
   stack pointer for thread ff01e0045840: ff0007fcdf00
   [ ff0007fcdf00 _resume_from_idle+0xf1() ]
 swtch+0x17f()
 cv_wait+0x61()
 zio_wait+0x5f()
 dbuf_read+0x1b5()
 dbuf_findbp+0xe8()
 dbuf_prefetch+0x9b()
 dmu_zfetch_fetch+0x43()
 dmu_zfetch_dofetch+0xc2()
 dmu_zfetch_find+0x3a1()
 dmu_zfetch+0xa5()
 dbuf_read+0xe3()
 dmu_buf_hold_array_by_dnode+0x1c4()
 dmu_read+0xd4()
 zfs_fillpage+0x15e()
 zfs_getpage+0x187()
 fop_getpage+0x9f()
 segvn_fault+0x9ef()
 as_fault+0x5ae()
 pagefault+0x95()
 trap+0x1286() 
 0xfb8001d9()  
 fuword8+0x21()
 zfs_write+0x147() 
 fop_write+0x69()  
 write+0x2af() 
 write32+0x1e()
 sys_syscall32+0x101() 
   
   
So, a write has been issued, zfs is retrieving a page and is waiting for 
the pagein  to complete.  I'll take a further look tomorrow,
but maybe someone else reading this has an idea.  (It is midnight here).

max

  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Richard Elling
Simon Breden wrote:
 Thanks a lot Richard. To give a bit more info, I've copied my 
 /var/adm/messages from
 booting up the machine:
   

 And @picker: I guess the 35 requests are stacked up waiting for the hanging 
 request to be serviced?

 The question I have is where do I go from now, to get some more info on what 
 is causing cp to have problems.

 I will now try another tack: use rsync to copy the directory to a disk 
 outside the pool (i.e. my home directory on the boot drive), to see if it is 
 happy doing that.


 May  1 17:04:15 zfsbox su: [ID 810491 auth.crit] 'su root' failed for simon 
 on /dev/pts/5
 May  1 17:48:15 zfsbox genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 
 Version snv_85 64-bit
 May  1 17:48:15 zfsbox genunix: [ID 172908 kern.notice] Copyright 1983-2008 
 Sun Microsystems, Inc.  All rights reserved.
 May  1 17:48:15 zfsbox Use is subject to license terms.
 May  1 17:48:15 zfsbox unix: [ID 126719 kern.info] features: 
 13f6fffcpuid,tscp,cmp,cx16,sse3,nx,asysc,sse2,sse,pat,cx8,pae,mca,mmx,cmov,de,pge,mtrr,msr,tsc,lgpg
 May  1 17:48:15 zfsbox unix: [ID 168242 kern.info] mem = 4192764K (0xffe7f000)
 May  1 17:48:15 zfsbox rootnex: [ID 466748 kern.info] root nexus = i86pc
 May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] pseudo0 at root
 May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] pseudo0 is /pseudo
 May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] scsi_vhci0 at root
 May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] scsi_vhci0 is /scsi_vhci
 May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] isa0 at root
 May  1 17:48:15 zfsbox pci_autoconfig: [ID 139057 kern.info] NOTICE: 
 reprogram io-range on ppb[0/f/0]: 0x2000 ~ 0x2fff
 May  1 17:48:15 zfsbox pci_autoconfig: [ID 595143 kern.info] NOTICE: add 
 io-range on subtractive ppb[0/6/0]: 0x3000 ~ 0x3fff
 May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local 
 nmi: 0 1 1 1
 May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local 
 nmi: 1 1 1 1
 May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local 
 nmi: 2 1 1 1
 May  1 17:48:15 zfsbox pcplusmp: [ID 658230 kern.info] NOTICE: apic: local 
 nmi: 3 1 1 1
 May  1 17:48:15 zfsbox pcplusmp: [ID 177709 kern.info] pcplusmp: vector 0x9 
 ioapic 0x4 intin 0x9 is bound to cpu 1
 May  1 17:48:15 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: acpippm0
 May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] acpippm0 is 
 /pseudo/[EMAIL PROTECTED]
 May  1 17:48:15 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: ppm0
 May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] ppm0 is /pseudo/[EMAIL 
 PROTECTED]
 May  1 17:48:15 zfsbox rootnex: [ID 349649 kern.info] npe0 at root: space 0 
 offset 0
 May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] npe0 is /[EMAIL 
 PROTECTED],0
 May  1 17:48:15 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: ide (ata) 
 instance 0 vector 0xe ioapic 0x4 intin 0xe is bound to cpu 0
 May  1 17:48:15 zfsbox genunix: [ID 640982 kern.info] IDE device at 
 targ 0, lun 0 lastlun 0x0
 May  1 17:48:15 zfsbox genunix: [ID 846691 kern.info] model 
 HDS722516VLAT80
 May  1 17:48:15 zfsbox genunix: [ID 479077 kern.info] ATA/ATAPI-6 
 supported, majver 0x7c minver 0x19
 May  1 17:48:15 zfsbox genunix: [ID 640982 kern.info] ATAPI device at 
 targ 1, lun 0 lastlun 0x0
 May  1 17:48:15 zfsbox genunix: [ID 846691 kern.info] model 
 TSSTcorpDVD-ROM SH-D162C
 May  1 17:48:15 zfsbox npe: [ID 236367 kern.info] PCI Express-device: [EMAIL 
 PROTECTED], ata0
 May  1 17:48:15 zfsbox genunix: [ID 936769 kern.info] ata0 is /[EMAIL 
 PROTECTED],0/[EMAIL PROTECTED]/[EMAIL PROTECTED]
 May  1 17:48:15 zfsbox genunix: [ID 773945 kern.info] UltraDMA mode 2 
 selected
 May  1 17:48:15 zfsbox genunix: [ID 773945 kern.info] UltraDMA mode 5 
 selected
 May  1 17:48:15 zfsbox gda: [ID 243001 kern.info] Disk0:  Vendor 
 'Gen-ATA ' Product 'HDS722516VLAT80 '
 May  1 17:48:17 zfsbox ata: [ID 496167 kern.info] cmdk0 at ata0 target 0 lun 0
 May  1 17:48:17 zfsbox genunix: [ID 936769 kern.info] cmdk0 is /[EMAIL 
 PROTECTED],0/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
 May  1 17:48:20 zfsbox unix: [ID 190185 kern.info] SMBIOS v2.4 loaded (2169 
 bytes)
 May  1 17:48:20 zfsbox genunix: [ID 408114 kern.info] /cpus (cpunex0) online
 May  1 17:48:20 zfsbox pseudo: [ID 129642 kern.info] pseudo-device: dld0
 May  1 17:48:20 zfsbox genunix: [ID 936769 kern.info] dld0 is /pseudo/[EMAIL 
 PROTECTED]
 May  1 17:48:20 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: 
 pciexclass,060400 (pcie_pci) instance 1 vector 0x18 ioapic 0xff intin 0xff is 
 bound to cpu 1
 May  1 17:48:20 zfsbox npe: [ID 236367 kern.info] PCI Express-device: 
 pci10de,[EMAIL PROTECTED], pcie_pci1
 May  1 17:48:20 zfsbox genunix: [ID 936769 kern.info] pcie_pci1 is /[EMAIL 
 PROTECTED],0/pci10de,[EMAIL PROTECTED]
 May  1 17:48:21 zfsbox pcplusmp: [ID 803547 kern.info] pcplusmp: 

Re: [zfs-discuss] cp -r hanged copying a directory

2008-05-01 Thread Richard Elling
[forget the BUI forum, e-mail works better, IMHO]

Simon Breden wrote:
 Thanks a lot Richard. To give a bit more info, I've copied my 
 /var/adm/messages from booting up the machine:
   

I don't see any major issues related to this problem in the messages.

 And @picker: I guess the 35 requests are stacked up waiting for the hanging 
 request to be serviced?
   

The 35 iops are queued in the device driver waiting to be sent to the
device (wait).  The iops queued to the device is in the active queue
(actv).

 The question I have is where do I go from now, to get some more info on what 
 is causing cp to have problems.
   

As your later mdb test confirmed, ZFS is patiently waiting on an I/O
to complete.  Solaris has sent that I/O to a device and hasn't heard a
reply yet.  By default, the sd driver will retry every 60 seconds, and
eventually fail the I/O which will bounce back up to ZFS as an I/O
error.  Please double check your drive firmware.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-04-28 Thread Simon Breden
I don't like the sound of broken hardware :(

I did the cp -r dir1 dir2 again and when it hanged I issued 'fmdump -e' like 
you said -- here is the output:

# fmdump -e
TIME CLASS
fmdump: /var/fm/fmd/errlog is empty
# 

I also checked /var/adm/messages and I didn't see anything in there either. So 
I don't know what else I can do to understand what's happening. Any ideas?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-04-28 Thread Rob Logan
  I did the cp -r dir1 dir2 again and when it hanged

when its hung, can you type:  iostat -xce 1
in another window and is there a 100 in the %b column?
when you reset and try the cp again, and look at
iostat -xce 1 on the second hang, is the same disk at 100 in %b?

if all your windows are hung, does your keyboard num locks LED
track the num locks key?

Rob

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] cp -r hanged copying a directory

2008-04-27 Thread Simon Breden
I installed b87 today and then I made a copy of a directory.

To my surprise, a few seconds later the drive access light went out. Upon 
inspection, only a couple of the files had been copied, and the cp command 
appeared to have hung.

I did: cp -r dir1 dir2

ps -el showed that cp was still running but not taking much, if any, processor 
time. Nothing changed after a minute or so, so I tried to kill the cp process 
with -15, then a -9. Even that failed.

Using the file manager (nautilus?), all I saw was a spinning logo and no files 
when I tried to access the directory where I copied data to.

Thinking this was a bug in b87, I rolled back to b85. Back in b85, I tried the 
same thing, and again, it failed in exactly the same way. So it was nothing to 
do with b87, but in fact, something funny to do with my files. I did a scrub 
and saw no errors.

What I now need to do is diagnose the problem. What do people here suggest I do 
next to find the root cause of this problem? I suspect there's something screwy 
with the files in the source directory -- some funny attribute somewhere (I 
remember some bug in b82 with file dates around 1970 or something). Luckily 
this directory didn't contain many files and I have a copy anyway, and the data 
is not critical.

-
Blog: http://breden.org.uk
-
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cp -r hanged copying a directory

2008-04-27 Thread Richard Elling
Simon Breden wrote:
 I installed b87 today and then I made a copy of a directory.

 To my surprise, a few seconds later the drive access light went out. Upon 
 inspection, only a couple of the files had been copied, and the cp command 
 appeared to have hung.

 I did: cp -r dir1 dir2

 ps -el showed that cp was still running but not taking much, if any, 
 processor time. Nothing changed after a minute or so, so I tried to kill the 
 cp process with -15, then a -9. Even that failed.
   

This is usually a symptom of broken hardware.  The reason kill -9
didn't work is because the process is waiting for the I/O to complete
and it hasn't completed yet.  When the I/O completes or times out,
then the process will start, get the kill signal and die. The timeout
may be on the order of 3-5 minutes for simple disks, though it may
be longer for some configurations.  Take a look at fmdump -e
to see if there are errors being logged.

[note to self: file RFE for the system log viewer to grab fmdump output]
 -- richard

 Using the file manager (nautilus?), all I saw was a spinning logo and no 
 files when I tried to access the directory where I copied data to.

 Thinking this was a bug in b87, I rolled back to b85. Back in b85, I tried 
 the same thing, and again, it failed in exactly the same way. So it was 
 nothing to do with b87, but in fact, something funny to do with my files. I 
 did a scrub and saw no errors.

 What I now need to do is diagnose the problem. What do people here suggest I 
 do next to find the root cause of this problem? I suspect there's something 
 screwy with the files in the source directory -- some funny attribute 
 somewhere (I remember some bug in b82 with file dates around 1970 or 
 something). Luckily this directory didn't contain many files and I have a 
 copy anyway, and the data is not critical.

 -
 Blog: http://breden.org.uk
 -
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss