[zfs-discuss] horrible slow pool
Hello everybody, I just wanted to share my experience with a (partially) broken SSD that was in use in a ZIL mirror. We experienced a dramatic performance problem with one of our zpools, serving home directories. Mainly NFS clients were affected. Our SunRay infrastructure came to a complete halt. Finally we were able to identify one SSD as the root caus. The SSD was still working, but quite slow. The issue didn't trigger ZFS to detect the disk as faulty. FMA didn't detect it, too. We identified the broken disk by issuing iostat -en'. After replacing the SSD, everything went back to normal. To prevent outages like this in the future I hacked together a quick and dirty bash script to detect disks with a given rate of total errors. The script might be used in conjunction with nagios. Perhaps it's of use for others sa well: ### #!/bin/bash # check disk in all pools for errors. # partially failing (or slow) disks # may result in horribly degradded # performance of zpools despite the fact # the pool is still healthy # exit codes # 0 OK # 1 WARNING # 2 CRITICAL # 3 UNKONOWN OUTPUT= WARNING=0 CRITICAL=0 SOFTLIMIT=5 HARDLIMIT=20 LIST=$(zpool status | grep c[1-9].*d0 | awk '{print $1}') for DISK in $LIST do ERROR=$(iostat -enr $DISK | cut -d , -f 4 | grep ^[0-9]) if [[ $ERROR -gt $SOFTLIMIT ]] then OUTPUT=$OUTPUT, $DISK:$ERROR WARNING=1 fi if [[ $ERROR -gt $HARDLIMIT ]] then OUTPUT=$OUTPUT, $DISK:$ERROR CRITICAL=1 fi done if [[ $CRITICAL -gt 0 ]] then echo CRITICAL: Disks with error count = $HARDLIMIT found: $OUTPUT exit 2 fi if [[ $WARNING -gt 0 ]] then echo WARNING: Disks with error count = $SOFTLIMIT found: $OUTPUT exit 1 fi echo OK: No significant disk errors found exit 0 ### cu Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Sol11 time-slider / snapshot not starting [again]
Hello everybody, my time-slider service on a Sol11 machine died. I already deinstalled/installed the time-slider packeage, restarted manifest-import service etc., but no success. /var/svc/log/application-time-slider:default.log: --snip-- [ Sep 11 12:40:04 Enabled. ] [ Sep 11 12:40:04 Executing start method (/lib/svc/method/time-slider start). ] Traceback (most recent call last): File /usr/lib/time-sliderd, line 10, in module main(abspath(__file__)) File /usr/lib/../share/time-slider/lib/time_slider/timesliderd.py, line 941, in main snapshot = SnapshotManager(systemBus) File /usr/lib/../share/time-slider/lib/time_slider/timesliderd.py, line 83, in __init__ self.refresh() File /usr/lib/../share/time-slider/lib/time_slider/timesliderd.py, line 188, in refresh self._rebuild_schedules() File /usr/lib/../share/time-slider/lib/time_slider/timesliderd.py, line 285, in _rebuild_schedules Details:\n + str(message) RuntimeError: Error reading SMF schedule instances Details: ['/usr/bin/svcs', '-H', '-o', 'state', 'svc:/system/filesystem/zfs/auto-snapshot:monthly'] failed with exit code 1 svcs: Pattern 'svc:/system/filesystem/zfs/auto-snapshot:monthly' doesn't match any instances Time Slider failed to start: error 95 [ Sep 11 12:40:06 Method start exited with status 95. ] --snip-- Any suggestions? thx Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sol11 time-slider / snapshot not starting [SOLVED]
-Original message- To: zfs-discuss@opensolaris.org; From: Carsten John cj...@mpi-bremen.de Sent: Tue 11-09-2012 13:08 Subject:[zfs-discuss] Sol11 time-slider / snapshot not starting [again] Hello everybody, my time-slider service on a Sol11 machine died. I already deinstalled/installed the time-slider packeage, restarted manifest-import service etc., but no success. /var/svc/log/application-time-slider:default.log: Finally I was able to fix it: - uninstall time-slider - restert manifest-import service - install time-slider - restart manifest-import service - enable time-slider service - enable snapshot services I have no clue why it has to be done exact in this order, but finally I succeeded. cu Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sol11 missing snapshot facility [solved]
-Original message- To: Carsten John cj...@mpi-bremen.de; CC: zfs-discuss@opensolaris.org; From: Ian Collins i...@ianshome.com Sent: Thu 05-07-2012 21:40 Subject:Re: [zfs-discuss] Sol11 missing snapshot facility On 07/ 5/12 11:32 PM, Carsten John wrote: -Original message- To: Carsten Johncj...@mpi-bremen.de; CC: zfs-discuss@opensolaris.org; From: Ian Collinsi...@ianshome.com Sent: Thu 05-07-2012 11:35 Subject:Re: [zfs-discuss] Sol11 missing snapshot facility On 07/ 5/12 09:25 PM, Carsten John wrote: Hi Ian, yes, I already checked that: svcs -a | grep zfs disabled 11:50:39 svc:/application/time-slider/plugin:zfs-send is the only service I get listed. Odd. How did you install? Is the manifest there (/lib/svc/manifest/system/filesystem/auto-snapshot.xml)? Hi Ian, I installed from CD/DVD, but it might have been in a rush, as I needed to replace a broken machine as quick as possible. The manifest is there: ls /lib/svc/manifest/system/filesystem/ . .. auto-snapshot.xml autofs.xml local-fs.xml minimal-fs.xml rmvolmgr.xml root-fs.xml ufs-quota.xml usr-fs.xml Running svcadm restart manifest-import should load it, or give you some idea why it won't load. -- Ian. Hi Ian, it did the trick, but I had to uninstall/install the time-slider package. thx for the help Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Sol11 missing snapshot facility
Hello everybody, for some reason I can not find the zfs-autosnapshot service facility any more. I already reinstalles time-slider, but it refuses to start: RuntimeError: Error reading SMF schedule instances Details: ['/usr/bin/svcs', '-H', '-o', 'state', 'svc:/system/filesystem/zfs/auto-snapshot:monthly'] failed with exit code 1 svcs: Pattern 'svc:/system/filesystem/zfs/auto-snapshot:monthly' doesn't match any instances did anybody know a way to get the services back again? thx Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sol11 missing snapshot facility
-Original message- To: Carsten John cj...@mpi-bremen.de; CC: zfs-discuss@opensolaris.org; From: Ian Collins i...@ianshome.com Sent: Thu 05-07-2012 09:59 Subject:Re: [zfs-discuss] Sol11 missing snapshot facility On 07/ 5/12 06:52 PM, Carsten John wrote: Hello everybody, for some reason I can not find the zfs-autosnapshot service facility any more. I already reinstalles time-slider, but it refuses to start: RuntimeError: Error reading SMF schedule instances Details: ['/usr/bin/svcs', '-H', '-o', 'state', 'svc:/system/filesystem/zfs/auto-snapshot:monthly'] failed with exit code 1 svcs: Pattern 'svc:/system/filesystem/zfs/auto-snapshot:monthly' doesn't match any instances Have you looked with svcs -a? # svcs -a | grep zfs disabled Jul_02 svc:/system/filesystem/zfs/auto-snapshot:daily disabled Jul_02 svc:/system/filesystem/zfs/auto-snapshot:frequent disabled Jul_02 svc:/system/filesystem/zfs/auto-snapshot:hourly disabled Jul_02 svc:/system/filesystem/zfs/auto-snapshot:monthly disabled Jul_02 svc:/system/filesystem/zfs/auto-snapshot:weekly disabled Jul_02 svc:/application/time-slider/plugin:zfs-send -- Ian. Hi Ian, yes, I already checked that: svcs -a | grep zfs disabled 11:50:39 svc:/application/time-slider/plugin:zfs-send is the only service I get listed. thx Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sol11 missing snapshot facility
-Original message- To: Carsten John cj...@mpi-bremen.de; CC: zfs-discuss@opensolaris.org; From: Ian Collins i...@ianshome.com Sent: Thu 05-07-2012 11:35 Subject:Re: [zfs-discuss] Sol11 missing snapshot facility On 07/ 5/12 09:25 PM, Carsten John wrote: Hi Ian, yes, I already checked that: svcs -a | grep zfs disabled 11:50:39 svc:/application/time-slider/plugin:zfs-send is the only service I get listed. Odd. How did you install? Is the manifest there (/lib/svc/manifest/system/filesystem/auto-snapshot.xml)? -- Ian. Hi Ian, I installed from CD/DVD, but it might have been in a rush, as I needed to replace a broken machine as quick as possible. The manifest is there: ls /lib/svc/manifest/system/filesystem/ . .. auto-snapshot.xml autofs.xml local-fs.xml minimal-fs.xml rmvolmgr.xml root-fs.xml ufs-quota.xml usr-fs.xml thx Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] snapshots slow on sol11?
Hello everybody, I recently migrated a file server (NFS Samba) from OpenSolaris (Build 111) to Sol11. This the move we are facing random (or random looking) outages of our Samba. As we have moved several folders (like Desktop and ApplicationData) out of the usual profile to a folder inside the users home share, the setup is sensible about timeouts. From time to time users are getting the infamous Windows Delayed Write Failure. After checking nearly every parameter that came to my mind in the last days, the zfs-auto-snapshot mechanism inside Solaris 11 came to my attention. We had hourly and daily snapshot enabled and discovered that the snapshots are not rotated as expected. As there were known issues (if I remember correctly) with timesliderd in OpenIndiana and we had the old zfs-auto-snap mechanism (without timesliderd) running without any problems before the update, I'm wondering if the are any known (performance) issues with the stuff in Solaris 11. thx Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] kernel panic during zfs import [UPDATE]
Hello everybody, just to let you know what happened in the meantime: I was able to open a Service Request at Oracle. The issue is a known bug (Bug 6742788 : assertion panic at: zfs:zap_deref_leaf) The bug has bin fixed (according to Oracle support) since build 164, but there is no fix for Solaris 11 available so far (will be fixed in S11U7?). There is a workaround available that works (partly), but my system crashed again when trying to rebuild the offending zfs within the affected zpool. At the moment I'm waiting for a so called interim diagnostic relief patch cu Carsten -- Max Planck Institut fuer marine Mikrobiologie - Network Administration - Celsiustr. 1 D-28359 Bremen Tel.: +49 421 2028568 Fax.: +49 421 2028565 PGP public key:http://www.mpi-bremen.de/Carsten_John.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] kernel panic during zfs import [ORACLE should notice this]
-Original message- To: zfs-discuss@opensolaris.org; From: John D Groenveld jdg...@elvis.arl.psu.edu Sent: Fri 30-03-2012 21:47 Subject:Re: [zfs-discuss] kernel panic during zfs import [ORACLE should notice this] In message 4f735451.2020...@oracle.com, Deepak Honnalli writes: Thanks for your reply. I would love to take a look at the core file. If there is a way this can somehow be transferred to the internal cores server, I can work on the bug. I am not sure about the modalities of transferring the core file though. I will ask around and see if I can help you here. How to Upload Data to Oracle Such as Explorer and Core Files [ID 1020199.1] John groenv...@acm.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Hi John, in the meantime I managed to open a service request at Oracle. There is a webportal https://supportfiles.sun.com. There you can upload the files... cu Carsten -- Max Planck Institut fuer marine Mikrobiologie - Network Administration - Celsiustr. 1 D-28359 Bremen Tel.: +49 421 2028568 Fax.: +49 421 2028565 PGP public key:http://www.mpi-bremen.de/Carsten_John.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Puzzling problem with zfs receive exit status
-Original message- To: zfs-discuss@opensolaris.org; From: Borja Marcos bor...@sarenet.es Sent: Thu 29-03-2012 11:49 Subject:[zfs-discuss] Puzzling problem with zfs receive exit status Hello, I hope someone has an idea. I have a replication program that copies a dataset from one server to another one. The replication mechanism is the obvious one, of course: zfs send -Ri from snapshot(n-1) snapshot(n) file scp file remote machine (I do it this way instead of using a pipeline so that a network error won't interrupt a receive data stream) and on the remote machine, zfs receive -Fd pool It's been working perfectly for months, no issues. However, yesterday we began to see something weird: the zfs receive being executed on the remote machine is exiting with an exit status of 1, even though the replication is finished, and I see the copied snapshots on the remote machine. Any ideas? It's really puzzling. It seems that the replication is working (a zfs list -t snapshot shows the new snapshots correctly applied to the dataset) but I'm afraid there's some kind of corruption. The OS is Solaris, SunOS 5.10 Generic_141445-09 i86pc i386 i86pc. Any ideas? Thanks in advance, Borja. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Hi Borja, did you try to check the snapshot file with zstreamdump? It will validate the checksums. Perhaps the information here http://blog.richardelling.com/2009/10/check-integrity-of-zfs-send-streams.html might be useful for you. Carsten -- Max Planck Institut fuer marine Mikrobiologie - Network Administration - Celsiustr. 1 D-28359 Bremen Tel.: +49 421 2028568 Fax.: +49 421 2028565 PGP public key:http://www.mpi-bremen.de/Carsten_John.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] kernel panic during zfs import
-Original message- To: ZFS Discussions zfs-discuss@opensolaris.org; From: Paul Kraus p...@kraus-haus.org Sent: Tue 27-03-2012 15:05 Subject:Re: [zfs-discuss] kernel panic during zfs import On Tue, Mar 27, 2012 at 3:14 AM, Carsten John cj...@mpi-bremen.de wrote: Hallo everybody, I have a Solaris 11 box here (Sun X4270) that crashes with a kernel panic during the import of a zpool (some 30TB) containing ~500 zfs filesystems after reboot. This causes a reboot loop, until booted single user and removed /etc/zfs/zpool.cache. From /var/adm/messages: savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf Page fault) rp=ff002f9cec50 addr=20 occurred in module zfs due to a NULL pointer dereference savecore: [ID 882351 auth.error] Saving compressed system crash dump in /var/crash/vmdump.2 I ran into a very similar problem with Solaris 10U9 and the replica (zfs send | zfs recv destination) of a zpool of about 25 TB of data. The problem was an incomplete snapshot (the zfs send | zfs recv had been interrupted). On boot the system was trying to import the zpool and as part of that it was trying to destroy the offending (incomplete) snapshot. This was zpool version 22 and destruction of snapshots is handled as a single TXG. The problem was that the operation was running the system out of RAM (32 GB worth). There is a fix for this and it is in zpool 26 (or newer), but any snapshots created while the zpool is at a version prior to 26 will have the problem on-disk. We have support with Oracle and were able to get a loaner system with 128 GB RAM to clean up the zpool (it took about 75 GB RAM to do so). If you are at zpool 26 or later this is not your problem. If you are at zpool 26, then test for an incomplete snapshot by importing the pool read only, then `zdb -d zpool | grep '%'` as the incomplete snapshot will have a '%' instead of a '@' as the dataset / snapshot separator. You can also run the zdb against the _un_imported_ zpool using the -e option to zdb. See the following Oracle Bugs for more information. CR# 6876953 CR# 6910767 CR# 7082249 CR#7082249 has been marked as a duplicate of CR# 6948890 P.S. I have a suspect that the incomplete snapshot was also corrupt in some strange way, but could never make a solid determination of that. We think what caused the zfs send | zfs recv to be interrupted was hitting an e1000g Ethernet device driver bug. -- {1-2-3-4-5-6-7-} Paul Kraus - Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) - Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) - Technical Advisor, Troy Civic Theatre Company - Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Hi, this scenario seems to fit. The machine that was sending the snapshot is on OpenSolaris Build 111b (which is running zpool version 14). I rebooted the receiving machine due to a hanging zfs receive that couldn't be killed. zdb -d -e pool does not give any useful information: zdb -d -e san_pool Dataset san_pool [ZPL], ID 18, cr_txg 1, 36.0K, 11 objects When importing the pool readonly, I get an error about two datasets: zpool import -o readonly=on san_pool cannot set property for 'san_pool/home/someuser': dataset is read-only cannot set property for 'san_pool/home/someotheruser': dataset is read-only As this is a mirror machine, I still have the option to destroy the pool and copy over the stuff via send/receive from the primary. But nobody knows how long this will work until I'm hit again If an interrupted send/receive can screw up a 30TB target pool, then send/receive isn't an option for replication data at all, furthermore it should be flagged as don't use it if your target pool might contain any valuable data I wil reproduce the crash once more and try to file a bug report for S11 as recommended by Deepak (not so easy these days...). thanks Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] kernel panic during zfs import [ORACLE should notice this]
-Original message- To: zfs-discuss@opensolaris.org; From: Deepak Honnalli deepak.honna...@oracle.com Sent: Wed 28-03-2012 09:12 Subject:Re: [zfs-discuss] kernel panic during zfs import Hi Carsten, This was supposed to be fixed in build 164 of Nevada (6742788). If you are still seeing this issue in S11, I think you should raise a bug with relevant details. As Paul has suggested, this could also be due to incomplete snapshot. I have seen interrupted zfs recv's causing weired bugs. Thanks, Deepak. Hi Deepak, I just spent about an hour (or two) trying to file a bug report regarding the issue without success. Seems to me, that I'm too stupid to use this MyOracleSupport portal. So, as I'm getting paid for keeping systems running and not clicking through flash overloaded support portals searching for CSIs, I'm giving the relevant information to the list now. Perhaps, someone at Oracle, reading the list, is able to file a bug report, or contact me off list. Background: Machine A - Sun X4270 - Opensolaris Build 111b - zpool version 14 - primary file server - sending snapshots via zfs send - direct attached Sun J4400 SAS JBODs with totally 40 TB storage Machine B - Sun X4270 - Solaris 11 - zpool version 33 - mirror server - receiving snapshots via zfs receive - FC attached Storagetek FLX280 storage Incident: After a zfs send/receive run machine B had a hanging zfs receive process. To get rid of the process, I rebooted the machine. During reboot the kernel panics, resulting in a reboot loop. To bring up the system, I rebooted single user, removed /etc/zfs/zpool.cache and rebooted again. The damaged pool can imported readonly, giving a warning: $zpool import -o readonly=on san_pool cannot set property for 'san_pool/home/someuser': dataset is read-only cannot set property for 'san_pool/home/someotheruser': dataset is read-only The ZFS debugger zdb does not give any additional information: $zdb -d -e san_pool Dataset san_pool [ZPL], ID 18, cr_txg 1, 36.0K, 11 objects The issue can reproduced by trying to import the pool r/w, resulting in a kernel panic. The fmdump utility gives the following information for the relevant UUID: $fmdump -Vp -u 91da1503-74c5-67c2-b7c1-d4e245e4d968 TIME UUID SUNW-MSG-ID Mar 28 2012 12:54:26.563203000 91da1503-74c5-67c2-b7c1-d4e245e4d968 SUNOS-8000-KL TIME CLASS ENA Mar 28 12:54:24.2698 ireport.os.sunos.panic.dump_available 0x Mar 28 12:54:05.9826 ireport.os.sunos.panic.dump_pending_on_device 0x nvlist version: 0 version = 0x0 class = list.suspect uuid = 91da1503-74c5-67c2-b7c1-d4e245e4d968 code = SUNOS-8000-KL diag-time = 1332932066 541092 de = fmd:///module/software-diagnosis fault-list-sz = 0x1 __case_state = 0x1 topo-uuid = 3b4117e0-0ac7-cde5-b434-b9735176d591 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = defect.sunos.kernel.panic certainty = 0x64 asru = sw:///:path=/var/crash/.91da1503-74c5-67c2-b7c1-d4e245e4d968 resource = sw:///:path=/var/crash/.91da1503-74c5-67c2-b7c1-d4e245e4d968 savecore-succcess = 1 dump-dir = /var/crash dump-files = vmdump.0 os-instance-uuid = 91da1503-74c5-67c2-b7c1-d4e245e4d968 panicstr = BAD TRAP: type=e (#pf Page fault) rp=ff002f6dcc50 addr=20 occurred in module zfs due to a NULL pointer dereference panicstack = unix:die+d8 () | unix:trap+152b () | unix:cmntrap+e6 () | zfs:zap_leaf_lookup_closest+45 () | zfs:fzap_cursor_retrieve+cd () | zfs:zap_cursor_retrieve+195 () | zfs:zfs_purgedir+4d () | zfs:zfs_rmnode+57 () | zfs:zfs_zinactive+b4 () | zfs:zfs_inactive+1a3 () | genunix:fop_inactive+b1 () | genunix:vn_rele+58 () | zfs:zfs_unlinked_drain+a7 () | zfs:zfsvfs_setup+f1 () | zfs:zfs_domount+152 () | zfs:zfs_mount+4e3 () | genunix:fsop_mount+22 () | genunix:domount+d2f () | genunix:mount+c0 () | genunix:syscall_ap+92 () | unix:brand_sys_sysenter+1cf () | crashtime = 1332931339 panic-time = March 28, 2012 12:42:19 PM CEST CEST (end fault-list[0]) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x4f72ede2 0x2191cbb8 The 'first view' debugger output looks like: mdb unix.0 vmcore.0 Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs mpt sd ip hook neti arp usba uhci sockfs qlc fctl s1394 kssl lofs random idm sppp crypto sata fcip cpc fcp ufs logindmux ptm ] $c zap_leaf_lookup_closest+0x45(ff0728eac588,
[zfs-discuss] zfs import from i86 to sparc
Hi everybody, are there any problems to expect if we try to export/import a zfs pool from opensolaris (intel) (zpool version 14) to solaris 10 (sparc) (zpool version 19)? thanks Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs send/receive script
Hello everybody, I set up a script to replicate all zfs filesystems (some 300 user home directories in this case) within a given pool to a mirror machine. The basic idea is to send the snapshots incremental if the corresponding snapshot exists on the remote side or send a complete snapshot if no corresponding previous snapshot is available Thee setup basically works, but form time to time (within a run over all filesystems) I get error messages like: cannot receive new filesystem stream: dataset is busy or cannot receive incremental filesystem stream: dataset is busy The complete script is available under: http://pastebin.com/AWevkGAd does anybody have a suggestion what might cause the dataset to be busy? thx Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] snapshots in solaris11 express
Hello everybody, is there any known way to configure the point-in-time *when* the time-slider will snapshot/rotate? With hundreds of zfs filesystems, the daily snapshot rotation slows down a big file server significantly, so it would be better to have the snapshots rotated outside the usual workhours. As as I found out so far, the first snapshot is taken when the service is restartet and then the next occurs 24 hour later (as supposed). Do I need to restart the service at 2:00 AM to get the desired result (not a big deal deal with /usr/bin/at, but not as straight forward as I would exspect). Any suggestions? thx Carsten -- Max Planck Institut fuer marine Mikrobiologie - Network Administration - Celsiustr. 1 D-28359 Bremen Tel.: +49 421 2028568 Fax.: +49 421 2028565 PGP public key:http://www.mpi-bremen.de/Carsten_John.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace zil drive
bin65MpxTCk5V.bin Description: PGP/MIME version identification encrypted.asc Description: OpenPGP encrypted message ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace zil drive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/28/11 02:55, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Carsten John Now I'm wondering about the best option to replace the HDD with the SSD: What version of zpool are you running? If it's = 19, then you could actually survive a complete ZIL device failure. So you should simply offline or detach or whatever the HDD and then either attach or add the new SDD. Attach would be mirror, add would be two separate non-mirrored devices. Maybe better performance, maybe not. If it's zpool 19, then you absolutely do not want to degrade to non-mirrored status. First attach the new SSD, then when it's done, detach the HDD. Sorry, sent encrypted before I'm currently running: zpool upgrade -v This system is currently running ZFS pool version 31 So, detaching the HDD seems to be a safe option. thx Carsten - -- Max Planck Institut fuer marine Mikrobiologie - - Network Administration - Celsiustr. 1 D-28359 Bremen Tel.: +49 421 2028568 Fax.: +49 421 2028565 PGP public key:http://www.mpi-bremen.de/Carsten_John.html -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk4Je/EACgkQsRCwZeehufuHwwCglTkFJPT54dUhyyh/rqWMhFLy sIQAn3VaJw5uDKCdhI917PDzKLb81zfY =wtA6 -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace zil drive [SOLVED}
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/28/11 02:55, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Carsten John Now I'm wondering about the best option to replace the HDD with the SSD: What version of zpool are you running? If it's = 19, then you could actually survive a complete ZIL device failure. So you should simply offline or detach or whatever the HDD and then either attach or add the new SDD. Attach would be mirror, add would be two separate non-mirrored devices. Maybe better performance, maybe not. If it's zpool 19, then you absolutely do not want to degrade to non-mirrored status. First attach the new SSD, then when it's done, detach the HDD. Worked like a charm. Detached the HDD, physically replaced the HDD with the new SSD and added the new SSD to the pool's log. thx for the suggestions Carsten - -- Max Planck Institut fuer marine Mikrobiologie - - Network Administration - Celsiustr. 1 D-28359 Bremen Tel.: +49 421 2028568 Fax.: +49 421 2028565 PGP public key:http://www.mpi-bremen.de/Carsten_John.html -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk4JilAACgkQsRCwZeehufvfhACdF0yae2NGDKrNEswRmW4NVEFv K5sAn0EnuAyOa9Z7ytNQQZF9QPJPjgeZ =fVxi -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] replace zil drive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello everybody, some time ago a SSD within a ZIL mirror died. As I had no SSD available to replace it, I dropped in a normal SAS harddisk to rebuild the mirror. In the meantime I got the warranty replacement SSD. Now I'm wondering about the best option to replace the HDD with the SSD: 1. Remove the log mirror, put the new disk in place, add log mirror 2. Pull the HDD, forcing the mirror to fail, replace the HDD with the SSD Unfortunately I have no free slot in the JBOD available (want to keep the ZIL in the same JBAD as the rest of the pool): 3. Put additional temporary SAS HDD in free slot of different JBOD, replace the HDD in the ZIL mirror with temporary HDD, pull now unused HDD, use free slot for SSD, replace temporary HDD with SSD. Any suggestions? thx Carsten - -- Max Planck Institut fuer marine Mikrobiologie - - Network Administration - Celsiustr. 1 D-28359 Bremen Tel.: +49 421 2028568 Fax.: +49 421 2028565 PGP public key:http://www.mpi-bremen.de/Carsten_John.html -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk4ISy8ACgkQsRCwZeehufs9MQCfetuYQwjbqH2Rb7qyY8G4vxaQ TvUAoNcHPnHED1Ykat8VHF8EJIRiPmct =jwZQ -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool died during scrub
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jeff Bacon wrote: I have a bunch of sol10U8 boxes with ZFS pools, most all raidz2 8-disk stripe. They're all supermicro-based with retail LSI cards. I've noticed a tendency for things to go a little bonkers during the weekly scrub (they all scrub over the weekend), and that's when I'll lose a disk here and there. OK, fine, that's sort of the point, and they're SATA drives so things happen. I've never lost a pool though, until now. This is Not Fun. ::status debugging crash dump vmcore.0 (64-bit) from ny-fs4 operating system: 5.10 Generic_142901-10 (i86pc) panic message: BAD TRAP: type=e (#pf Page fault) rp=fe80007cb850 addr=28 occurred in module zfs due to a NULL pointer dereference dump content: kernel pages only $C fe80007cb960 vdev_is_dead+2() fe80007cb9a0 vdev_mirror_child_select+0x65() fe80007cba00 vdev_mirror_io_start+0x44() fe80007cba30 zio_vdev_io_start+0x159() fe80007cba60 zio_execute+0x6f() fe80007cba90 zio_wait+0x2d() fe80007cbb40 arc_read_nolock+0x668() fe80007cbbd0 dmu_objset_open_impl+0xcf() fe80007cbc20 dsl_pool_open+0x4e() fe80007cbcc0 spa_load+0x307() fe80007cbd00 spa_open_common+0xf7() fe80007cbd10 spa_open+0xb() fe80007cbd30 pool_status_check+0x19() fe80007cbd80 zfsdev_ioctl+0x1b1() fe80007cbd90 cdev_ioctl+0x1d() fe80007cbdb0 spec_ioctl+0x50() fe80007cbde0 fop_ioctl+0x25() fe80007cbec0 ioctl+0xac() fe80007cbf10 _sys_sysenter_post_swapgs+0x14b() pool: srv id: 9515618289022845993 state: UNAVAIL status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-6X config: srvUNAVAIL missing device raidz2 ONLINE c2t5000C5001F2CCE1Fd0 ONLINE c2t5000C5001F34F5FAd0 ONLINE c2t5000C5001F48D399d0 ONLINE c2t5000C5001F485EC3d0 ONLINE c2t5000C5001F492E42d0 ONLINE c2t5000C5001F48549Bd0 ONLINE c2t5000C5001F370919d0 ONLINE c2t5000C5001F484245d0 ONLINE raidz2 ONLINE c2t5F000B5C8187d0 ONLINE c2t5F000B5C8157d0 ONLINE c2t5F000B5C9101d0 ONLINE c2t5F000B5C8167d0 ONLINE c2t5F000B5C9120d0 ONLINE c2t5F000B5C9151d0 ONLINE c2t5F000B5C9170d0 ONLINE c2t5F000B5C9180d0 ONLINE raidz2 ONLINE c2t5000C50010A88E76d0 ONLINE c2t5000C5000DCD308Cd0 ONLINE c2t5000C5001F1F456Dd0 ONLINE c2t5000C50010920E06d0 ONLINE c2t5000C5001F20C81Fd0 ONLINE c2t5000C5001F3C7735d0 ONLINE c2t5000C500113BC008d0 ONLINE c2t5000C50014CD416Ad0 ONLINE Additional devices are known to be part of this pool, though their exact configuration cannot be determined. All of this would be ok... except THOSE ARE THE ONLY DEVICES THAT WERE PART OF THE POOL. How can it be missing a device that didn't exist? A zpool import -fF results in the above kernel panic. This also creates /etc/zfs/zpool.cache.tmp, which then results in the pool being imported, which leads to a continuous reboot/panic cycle. I can't obviously use b134 to import the pool without logs, since that would imply upgrading the pool first, which is hard to do if it's not imported. My zdb skills are lacking - zdb -l gets you about so far and that's it. (where the heck are the other options to zdb even written down, besides in the code?) OK, so this isn't the end of the world, but it's 15TB of data I'd really rather not have to re-copy across a 100Mbit line. It really more concerns me that ZFS would do this in the first place - it's not supposed to corrupt itself!! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Hi Jeff, looks similar to a crash I had here at our site a few month ago. Same symptoms, no actual solution. We had to recover from a rsync backup server. We had the logs on an mirrored SSD and an additional SSD as cache. The machine (SUN 4270 with SUN J4400 JBODS and SUN SAS disks) crashed in the same manner (core dumping while trying to import the pool). After booting into single user mode we found the log pool mirror corrupted (one disk unavailbale). Even after replacing the disk and resilvering the log mirror we were not able to import the pool. I suggest that it may has been related to memory (perhaps a lack of memory). all the best Carsten - -- Max Planck Institut fuer marine Mikrobiologie - - Network Administration - Celsiustr. 1
[zfs-discuss] crashed zpool
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello everybody, last week we experienced a severe outage due to a crashed zpool. I'm now in the process of investigating the reason for the crash, to prevernt it in the future. May be some of the people with more experience are able to help me The setup: - - Sun Fire X4270 with 16GB RAM running Opensolaris 2009.06 acting as samba PDC and NIS/NFS server for some 400 users. - - sas_zpool built of 24x 300GB SAS disks (4x raidz) in JBOD, 2x 32GB SSD (mirror) for zfs log, 1x 160 GB SSD for zfs cache - - bulk_pool containing 42x 1TB SATA/SAS disks in 2 JBODS the machine worked several month without a problem. A week ago we added the last set of 6 disks to the sas_pool. What happened: the server became unavailable, obviously it had crashed and wrote a kernel core dump. After rebooting the machine the server crashed again (core dumping) while trying to mount the zfs filessytems (home directories) from the sas_pool. We booted single user and checked the zpool status. The sas_pool was degraded with a failed SSD disk in the log mirror. We replaced the failed disk and waited until the resilvering process had finished (took some 4 hours). zpool status for the pool was fine after that. Rebooting the machine in multi user mode resulted in the same core dump like before. Fortunately we had a rsync mirror of our home directories (second 4270 with a bunch of SATA JBODs). We finally mounted the spare machine via NFS instead of the crashed pool to keep services running. What might be the reason? - - the failed SSD (shouldn't harm as it is mirrored) - - not enough RAM causing the crash, damaging the zpool Is there any chance to reanimate the crashed pool, otherwise we need to build the pool from scratch and rsync from the fallback (this will take several days) Thanks in advance for any suggestions Carsten - -- Max Planck Institut fuer marine Mikrobiologie - - Network Administration - Celsiustr. 1 D-28359 Bremen Tel.: +49 421 2028568 Fax.: +49 421 2028565 PGP public key:http://www.mpi-bremen.de/Carsten_John.html -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFLi3sosRCwZeehufsRAqXmAKDg2KoR1exq4jTMkiR8iBt+xsDW1QCgjsrO mK4uYJec0A3oO1kQCyM9XFQ= =icmr -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss