Re: [zfs-discuss] ZFS 60 second pause times to read 1K
I've got the box on eval, and just pushing through its paces. Ideally I would be replicating to another x4500, but I don't have another one and didn't want to use 22 disks for another pool. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 60 second pause times to read 1K
Michael Kucharski wrote: We have a x4500 setup as a single 4*( raid2z 9 + 2)+2 spare pool and have the files system mounted over v5 krb5 NFS and accessed directly. The pool is a 20TB pool and is using . There are three filesystems, backup, test and home. Test has about 20 million files and uses 4TB. These files range from 100B to 200MB. Test has a cron job to take snapshots every 15 minutes from 1m on the hour. Every 15min at 2min on the hour a cron batch job runs to zfs send/recv to the backup filesystem. Home has only 100GB. Are you doing this send|recv within the same pool? What's the motivation for that? Can't you just use zfs clone, which would be much faster and use less disk space? Or if you want another copy (which seems unlikely since you can already tolerate any 2 disks failing in your pool), then use zfs set copies=2 fs. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 60 second pause times to read 1K
Thanks. Looks like I have this bug. Is it a hardware problem combined with a software problem? Oct 9 09:35:43 zeta1 sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Oct 9 09:35:43 zeta1 port 3: device reset Oct 9 09:35:43 zeta1 scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd18): Oct 9 09:35:43 zeta1 Error for Command: read(10)Error Level: Retryable Oct 9 09:35:43 zeta1 scsi: [ID 107833 kern.notice] Requested Block: 6063576 Error Block: 6063576 Oct 9 09:35:43 zeta1 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Oct 9 09:35:43 zeta1 scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Oct 9 09:35:43 zeta1 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Oct 9 09:35:43 zeta1 sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Oct 9 09:35:43 zeta1 port 3: device reset Oct 9 09:35:43 zeta1 sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Oct 9 09:35:43 zeta1 port 3: link lost Oct 9 09:35:43 zeta1 sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: Oct 9 09:35:43 zeta1 port 3: link established Oct 9 09:35:43 zeta1 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx4: error on port 3: Oct 9 09:35:43 zeta1 marvell88sx: [ID 517869 kern.info]device disconnected Oct 9 09:35:43 zeta1 marvell88sx: [ID 517869 kern.info]device connected This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 60 second pause times to read 1K
I'm not sure. But when I would re-run a scrub, I got the errors at the same block numbers, which indicated that the disk was really bad. It wouldn't hurt to make the entry in the /etc/system file, reboot, and then try the scrub again. If the problem disappears then it is a driver bug. Gary This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS 60 second pause times to read 1K
Every day we see pause times of sometime 60 seconds to read 1K of a file for local reads as well as NFS in a test setup. We have a x4500 setup as a single 4*( raid2z 9 + 2)+2 spare pool and have the files system mounted over v5 krb5 NFS and accessed directly. The pool is a 20TB pool and is using . There are three filesystems, backup, test and home. Test has about 20 million files and uses 4TB. These files range from 100B to 200MB. Test has a cron job to take snapshots every 15 minutes from 1m on the hour. Every 15min at 2min on the hour a cron batch job runs to zfs send/recv to the backup filesystem. Home has only 100GB. The test dir has 3 directories, 1 has 130,000 files other 2 have 10,000,000. We have 4 processes, 2 over NFS, 2 local, and 2 reading the dir with 130,000 files, the other 2 reading the dir with 10,000,000. Every 35 seconds each process reads 1K of 64thK of 10 files and records the latency, then reads for 1 second and record the throughput. At times of no other activity (outside the snapshot and send/recv times) we see read latencies of up to 60 seconds, maybe once a day at random times. We are using an unpatched Solaris 10 08/07 build. Pause times this long can to timeouts and jobs to fails which is problematic for us. Is this expected behaviour? Can anything be done to mitigate or diagnose the issue? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 60 second pause times to read 1K
Are there any clues in the logs? I have had a similar problem when a disk bad block was uncovered by zfs. I've also seen this when using the Silicon Image driver without the recommended patch. The former became evident when I ran a scrub. I saw the SCSI timeout errors pop up in the kern syslogs. I solved this by replacing the disk. Gary This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 60 second pause times to read 1K
Excellent. Oct 9 13:36:01 zeta1 scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd13): Oct 9 13:36:01 zeta1 Error for Command: readError Level: Retryable Scrubbing now. Big thanks ggendel. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 60 second pause times to read 1K
Michael m.kucharski at bigfoot.com writes: Excellent. Oct 9 13:36:01 zeta1 scsi: [ID 107833 kern.warning] WARNING: /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 2,0 (sd13): Oct 9 13:36:01 zeta1 Error for Command: readError Level: Retryable Scrubbing now. This is only a part of the complete error message. Look a few lines above this one. If you see something like: sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: port 1: device reset sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: port 1: link lost sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]: port 1: link established marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 1: marvell88sx: [ID 517869 kern.info] device disconnected marvell88sx: [ID 517869 kern.info] device connected Then it means you are probably affected by bug http://bugs.opensolaris.org/view_bug.do?bug_id=6587133 This bug is fixed in Solaris Express build 73 and above, and will likely be fixed in Solaris 10 Update 5. The workaround is to disable SATA NCQ and queuing by adding set sata:sata_func_enable = 0x4 to /etc/system. -marc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss