I have moved this saga to storage-discuss now, as this doesn't appear to be a
ZFS issue, and it can be found here:
http://www.opensolaris.org/jive/thread.jspa?threadID=59201
This message posted from opensolaris.org
___
zfs-discuss mailing list
I have moved this saga to storage-discuss now, as this doesn't appear to be a
ZFS issue, and it can be found here:
http://www.opensolaris.org/jive/thread.jspa?threadID=59201
This message posted from opensolaris.org
___
zfs-discuss mailing list
Thanks Max, I have done a few tests with what you suggest and I have listed the
output below. I wait a few minutes before deciding it's failed, and there is
never any console output about anything failing, and nothing in any log files
I've looked in: /var/adm/messages or /var/log/syslog. Maybe
Well, I had some more ideas and ran some more tests:
1. cp -r testdir ~/z1
This copied the testdir directory from the zfs pool into my home directory on
the IDE boot drive, so not part of the zfs pool, and this worked.
2. cp -r ~/z1 .
This copied the files back from my home directory on the
The plot thickens. I replaced 'cp' with 'rsync' and it worked -- I ran it a few
times and it didn't hang so far.
So on the face of it, it appears that 'cp' is doing something that causes my
system to hang if the files are read from and written to the same pool, but
simply replacing 'cp' with
Hi Simon,
Simon Breden wrote:
The plot thickens. I replaced 'cp' with 'rsync' and it worked -- I ran it a
few times and it didn't hang so far.
So on the face of it, it appears that 'cp' is doing something that causes my
system to hang if the files are read from and written to the same pool,
oops, I lied... according to my self
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-January/045141.html
wait are queued in solaris and active 1 are in
the drives NCQ.
so the question is: Where are the drive's command getting
dropped across 3 disks at the same time?
and in all cases
Thanks Max, and the fact that rsync stresses the system less would help explain
why rsync works, and cp hangs. The directory was around 11GB in size.
If Sun engineers are interested in this problem then I'm happy to run whatever
commands they give me -- after all, I have a pure goldmine here
Hi Simon,
Simon Breden wrote:
Thanks Max, and the fact that rsync stresses the system less would help
explain why rsync works, and cp hangs. The directory was around 11GB in size.
If Sun engineers are interested in this problem then I'm happy to run
whatever commands they give me -- after
I have similar, but not exactly the same drives:
format inq
Vendor: ATA
Product: WDC WD7500AYYS-0
Revision: 4G30
Same firmware revision. I have no problems with drive performance,
although I use them under UFS and for backing stores for iscsi disks.
FYI, I had random lockups and crashes on
Wow, thanks Dave. Looks like you've had this hell too :)
So, that makes me happy that the disks and pool are probably OK, but it does
seem an issue with the NVidia MCP 55 chipset, or at least perhaps the nv_sata
driver. From reading the bug list below, it seems the problem might be a more
OK, I tried replying by email, and got a message that a moderator will approve
the message sometime... but that was a few hours ago, so I'm reverting to this
forum software again :)
Here's the reply I emailed:
Hi Richard,
I ran the format comand, selected the number of one of the disks in the
Thanks Max,
I have not been able to find any new firmware for these drives (Western Digital
WD7500AAKS) so I have sent an email to Western Digital to enquire about
firmware updates. I'll see what they reply with, but I'm not too hopeful.
In the meantime I decided to copy the files one at a
Hi Simon,
One quick note. You don't have to cp each file one at a time to see
which one it hangs on. Just run truss. It should be the last file
that it opened. To see this with truss, do:
truss cp -r ...
Don't worry about all the truss output. You are probably only
concerned with the
or work around the NCQ bug in the drive's FW by typing:
su
echo set sata:sata_max_queue_depth = 0x1 /etc/system
reboot
Rob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
Simon,
I think you should review the checksum error reports from the fmdump
output (dated 4/30) that you supplied previously.
You can get more details by using fmdump -ev.
Use zpool status -v to identify checksum errors as well.
Cindy
Simon Breden wrote:
Thanks Max,
I have not been able
Thanks Cindy.
Here's the zpool status -v output:
# zpool status -v tank
pool: tank
state: ONLINE
scrub: none requested
config:
NAMESTATE READ WRITE CKSUM
tankONLINE 0 0 0
raidz1ONLINE 0 0 0
c1t1d0
Okay, thanks.
I wanted to rule out that the checksum errors reported on 4/30
were persistent enough to be picked up by zpool status. ZFS is
generally quick to identify device problems.
Since fmdump doesn't show any add'l recent errors either, then I
think you can rule out hardware problems other
OK then, thanks Cindy.
I have 2 current lines of investigation left (at least):
1. assumption the problem could relate to a drive firmware bug
2. there's a new BIOS for the motherboard available which might possibly have
some effect
For the idea that it's a drive firmware bug, I'm currently
Assuming that this problem could be related to a drive firmware bug I have:
1. tuned off NCQ -- or in fact limited the queue depth to 1
2. used truss with the cp command
I found this for NCQ: http://blogs.sun.com/erickustarz/entry/ncq_tunable
=
NCQ
Sorry for the delay. Here is the output for a couple of seconds:
# iostat -xce 1
extended device statistics errors ---
cpu
devicer/sw/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot us
sy wt id
cmdk0 1.50.7 20.84.2 0.0
Simon Breden wrote:
Sorry for the delay. Here is the output for a couple of seconds:
This is the smoking gun...
# iostat -xce 1
extended device statistics errors ---
cpu
devicer/sw/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn
hmm, three drives with 35 io requests in the queue
and none active? remind me not to buy a drive
with that FW..
1) upgrade the FW in the drives or
2) turn off NCQ with:
echo set sata:sata_max_queue_depth = 0x1 /etc/system
Rob
Thanks a lot Richard. To give a bit more info, I've copied my /var/adm/messages
from booting up the machine:
And @picker: I guess the 35 requests are stacked up waiting for the hanging
request to be serviced?
The question I have is where do I go from now, to get some more info on what is
Hi Simon,
Simon Breden wrote:
Thanks a lot Richard. To give a bit more info, I've copied my
/var/adm/messages from booting up the machine:
And @picker: I guess the 35 requests are stacked up waiting for the hanging
request to be serviced?
The question I have is where do I go from now, to
This list seems out of sync (delayed) with email messages I receive.
Why is that?
Which are the best tools to use when reading / replying to these posts?
Anyway from my email I can see that Max has sent me a question about truss --
here is my reply:
Hi Max,
I haven't used truss before, but
Hi Simon,
Simon Breden wrote:
Hi Max,
I haven't used truss before, but give me the command line + switches
and I'll be happy to run it.
Simon
# truss -p pid_from_cp
where pid_from_cp is... the pid of the cp process that is hung. The
pid you can get from ps.
I am curious if the cp is
This mailing list seems broken and out of sync -- your post is as 'Guest' and
appears as a new post in the main zfs-discuss list -- and the main thread is
out of sync with the replies, and I just got a java exception trying to post to
the main thread -- what's going on here?
This message
Hi Max,
I re-ran the cp command and when it hanged I ran 'ps -el' looked up the cp
command, got it's PID and then ran:
# truss -p PID_of_cp
and it output nothing at all -- i.e. it hanged too -- just showing a flashing
cursor.
The system is still operational as I am typing into the browser.
Hi Simon,
Simon Breden wrote:
Hi Max,
I re-ran the cp command and when it hanged I ran 'ps -el' looked up the cp
command, got it's PID and then ran:
# truss -p PID_of_cp
and it output nothing at all -- i.e. it hanged too -- just showing a flashing
cursor.
The system is still
Keep getting Java exceptions posting to the proper thread for this -- just lost
an hour --- WTF???
Had to reply to my own post as Max's reply (which I saw in my email inbox) has
not appeared here. Again, what is wrong with this forum software -- it seems so
buggy, or am I missing something
Just to reduce my stress levels and to give the webmaster some useful info to
help fix this broken forum:
I tried posting a reply to the main thread for 'cp -r hanged copying a
directory' and got the following error -- seems like it can't find the parent
thread/message's id in the database at
Hi Simon,
Simon Breden wrote:
Thanks for your advice Max, and here is my reply to your suggestion:
# mdb -k
Loading modules: [ unix genunix specfs dtrace cpu.generic
cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba
s1394 nca lofs zfs random md sppp smbsrv nfs
Simon Breden wrote:
Thanks a lot Richard. To give a bit more info, I've copied my
/var/adm/messages from
booting up the machine:
And @picker: I guess the 35 requests are stacked up waiting for the hanging
request to be serviced?
The question I have is where do I go from now, to get
[forget the BUI forum, e-mail works better, IMHO]
Simon Breden wrote:
Thanks a lot Richard. To give a bit more info, I've copied my
/var/adm/messages from booting up the machine:
I don't see any major issues related to this problem in the messages.
And @picker: I guess the 35 requests
I don't like the sound of broken hardware :(
I did the cp -r dir1 dir2 again and when it hanged I issued 'fmdump -e' like
you said -- here is the output:
# fmdump -e
TIME CLASS
fmdump: /var/fm/fmd/errlog is empty
#
I also checked /var/adm/messages and I didn't see anything in
I did the cp -r dir1 dir2 again and when it hanged
when its hung, can you type: iostat -xce 1
in another window and is there a 100 in the %b column?
when you reset and try the cp again, and look at
iostat -xce 1 on the second hang, is the same disk at 100 in %b?
if all your windows are hung,
Simon Breden wrote:
I installed b87 today and then I made a copy of a directory.
To my surprise, a few seconds later the drive access light went out. Upon
inspection, only a couple of the files had been copied, and the cp command
appeared to have hung.
I did: cp -r dir1 dir2
ps -el
38 matches
Mail list logo