Re: [zfs-discuss] Disk Issues
Ok, I changed the cable and also tried swapping the port on the motherboard. The drive continued to have huge asvc_t and also started to have huge wsvc_t. I unplugged it and the 'pool' is now operating as per expected performance wise. See the 'storage' forum for any further updates as I am now convinced this has nothing to do with ZFS or my attempt to disable the ZIL. 8-) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL + L2ARC SSD Setup
Hi Daniel, Am 08.02.10 05:45, schrieb Daniel Carosone: On Mon, Feb 08, 2010 at 04:58:38AM +0100, Felix Buenemann wrote: I have some questions about the choice of SSDs to use for ZIL and L2ARC. I have one answer. The other questions are mostly related to your raid controller, which I can't answer directly. - Is it safe to run the L2ARC without battery backup with write cache enabled? Yes, it's just a cache, errors will be detected and re-fetched from the pool. Also, it is volatile-at-reboot (starts cold) at present anyway, so preventing data loss at power off is not worth spending any money or time over. Thanks for clarifying this. - Does it make sense to use HW RAID10 on the storage controller or would I get better performance out of JBOD + ZFS RAIDZ2? A more comparable alternative would be using the controller in jbod mode and a pool of zfs mirror vdevs. I'd expect that gives similar performance to the controller's mirroring (unless higher pci bus usage is a bottleneck) but gives you the benefits of zfs healing on disk errors. I was under the impression, that using HW RAID10 would save me 50% PCI bandwidth and allow the controller to more intelligently handle its cache, so I sticked with it. But I should run some benchmarks in RAID10 vs. JBOD with ZFS mirrors to see if this makes a difference. Performance of RaidZ/5 vs mirrors is a much more workload-sensitive question, regardless of the additional implementation-specific wrinkles of either kind. Your emphasis on lots of slog and l2arc suggests performance is a priority. Whether all this kit is enough to hide the IOPS penalty of raidz/5, or whether you need it even to make mirrors perform adequately, you'll have to decide yourself. So it seems right to assume, that RAIDZ1/2 has about the same performance hit as HW RAID5/6 with Write Cache. I wasn't aware that ZFS can do RAID10 style multiple mirrors, so that seems to be the better option anyways. -- Dan. - Felix ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
Hello, an idea popped into my mind while talking about security and intrusion detection. Host based ID may use Checksumming for file change tracking. It works like this: Once installed and knowning the software is OK, a baseline is created. Then in every check - verify the current status of the data with the baseline and report changes. An example for this is AIDE. The difficult part is the checksumming - this takes time. My idea would be to use ZFS snapshots for this. baseline creation = create snapshot baseline verification = verify the checksums of the objects and report objects diffent This could work for non-zvol environments. Is it possible to extract the checksums of ZFS objects with a command line tool ? Regards, Robert -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Big send/receive hangs on 2009.06
So, I was running my full backup last night, backing up my main data pool zp1, and it seems to have hung. Any suggestions for additional data gathering? -bash-3.2$ zpool status zp1 pool: zp1 state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAMESTATE READ WRITE CKSUM zp1 ONLINE 0 0 0 mirrorONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 errors: No known data errors to one of my external USB drives holding pool bup-wrack -bash-3.2$ zpool status bup-wrack pool: bup-wrack state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM bup-wrack ONLINE 0 0 0 c7t0d0ONLINE 0 0 0 errors: No known data errors The line in the script that starts the send and receive is zfs send -Rv $srcsnap | zfs recv -Fudv $BUPPOOL/$HOSTNAME/$FS And the -v causes the start and stop of each incremental stream to be announced of course. The last output from it was: sending from @bup-20090315-190807UTC to zp1/d...@bup-20090424-034702utc receiving incremental stream of zp1/d...@bup-20090424-034702utc into bup-wrack/fsfs/zp1/d...@bup-20090424-034702utc And it appears hung when I got up this morning. No activity on the drive, zpool iostat shows no activity on the backup pool and no unexplained activity on the data pool. The server is responsive, and the data pool is responsive. ps shows considerable accumulated time on the backup and receive processes, but no change in the last half hour. zpool list shows that quite a lot of data has not yet been transferred to the backup pool (which was newly-created when this backup started). -bash-3.2$ zpool list NAMESIZE USED AVAILCAP HEALTH ALTROOT bup-wrack 928G 438G 490G47% ONLINE /backups/bup-wrack rpool74G 6.35G 67.7G 8% ONLINE - zp1 744G 628G 116G84% ONLINE - ps -ef shows root 3153 3145 0 23:09:07 pts/3 19:59 zfs recv -Fudv bup-wrack/fsfs/zp1 root 3145 3130 0 23:09:04 pts/3 0:00 /bin/bash ./bup-backup-full zp1 bup-wrack root 3152 3145 0 23:09:07 pts/3 17:06 zfs send -Rv z...@bup-20100208-050907gmt zfs list shows: -bash-3.2$ zfs list -t snapshot,filesystem -r zp1 NAME USED AVAIL REFER MOUNTPOINT zp1 628G 104G 33.8M /home z...@bup-20090223-033745utc 0 - 33.8M - z...@bup-20090225-184857utc 0 - 33.8M - z...@bup-20090302-032437utc 0 - 33.8M - z...@bup-20090309-033514utc 0 - 33.8M - z...@bup-20090315-190807utc 0 - 33.8M - z...@bup-20090424-034702utc22K - 33.8M - z...@bup-20090619-063536gmt 0 - 33.8M - z...@bup-20090619-143851utc 0 - 33.8M - z...@bup-20090804-024506utc 0 - 33.8M - z...@bup-20090906-192431utc 0 - 33.8M - z...@bup-20100102-035216utc 0 - 33.8M - z...@bup-20100102-184101utc 0 - 33.8M - z...@bup-20100208-050707gmt 0 - 33.8M - z...@bup-20100208-050907gmt 0 - 33.8M - zp1/ddb 494G 104G 452G /home/ddb zp1/d...@bup-20090223-033745utc 5.12M - 326G - zp1/d...@bup-20090225-184857utc 4.15M - 328G - zp1/d...@bup-20090302-032437utc 16.6M - 329G - zp1/d...@bup-20090309-033514utc 8.95M - 330G - zp1/d...@bup-20090315-190807utc 35.3M - 330G - zp1/d...@bup-20090424-034702utc 140M - 345G - zp1/d...@bup-20090619-063536gmt 43.9M - 386G - zp1/d...@bup-20090619-143851utc 44.9M - 386G - zp1/d...@bup-20090804-024506utc 4.30G - 418G - zp1/d...@bup-20090906-192431utc 8.43G - 440G - zp1/d...@bup-20100102-035216utc 4.13G - 435G - zp1/d...@bup-20100102-184101utc 108M - 431G - zp1/d...@bup-20100208-050707gmt 142K - 452G - zp1/d...@bup-20100208-050907gmt 140K - 452G - zp1/jmf 33.5G 104G 33.3G /home/jmf zp1/j...@bup-20090223-033745utc 0 - 33.2G - zp1/j...@bup-20090225-184857utc 0 - 33.2G - zp1/j...@bup-20090302-032437utc 0 - 33.2G - zp1/j...@bup-20090309-033514utc 0 - 33.2G - zp1/j...@bup-20090315
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
On 08/02/2010 12:55, Lutz Schumann wrote: Hello, an idea popped into my mind while talking about security and intrusion detection. Host based ID may use Checksumming for file change tracking. It works like this: Once installed and knowning the software is OK, a baseline is created. Then in every check - verify the current status of the data with the baseline and report changes. An example for this is AIDE. The difficult part is the checksumming - this takes time. My idea would be to use ZFS snapshots for this. baseline creation = create snapshot baseline verification = verify the checksums of the objects and report objects diffent This could work for non-zvol environments. Is it possible to extract the checksums of ZFS objects with a command line tool ? Only with the zdb(1M) tool but note that the checksums are NOT of files but of the ZFS blocks. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recover ZFS Array after OS Crash?
On 06/02/2010 13:18, Fajar A. Nugraha wrote: On Sat, Feb 6, 2010 at 1:32 AM, Jjahservan...@gmail.com wrote: saves me hundreds on HW-based RAID controllers ^_^ ... which you might need to fork over to buy additional memory or faster CPU :P Don't get me wrong, zfs is awesome, but to do so it needs more CPU power and RAM (and possibly SSD) compared to other filesystems. If your main concern is cost, then some HW raid controller might be more effective. any real data to back your claims? Then you need to be realistic - if ZFS consumes lets say 10-30% more CPU but still can do several GBs (assuming your storage can handle it) on a modern x86 box then for 99% of use cases where *much* less data is being actually handled by a fs in real workloads the difference in CPU usage in neglectable. This is even more so for fileservers (as in the OP case) where the box is usually dedicated to do a fileserving only. In real life in most environments, ZFS or not, the lvm/fs layer consume much less than 10% of your CPU on an entry level x86 server and if ZFS would consume a little bit more it doesn't really matter. For example IIRC an old x4500 (older AMD CPUs) can do about 2GB/s sustained throughput when using ZFS while still not saturating CPUs. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL + L2ARC SSD Setup
On Mon, 8 Feb 2010, Felix Buenemann wrote: I was under the impression, that using HW RAID10 would save me 50% PCI bandwidth and allow the controller to more intelligently handle its cache, so I sticked with it. But I should run some benchmarks in RAID10 vs. JBOD with ZFS mirrors to see if this makes a difference. The answer to this is it depends. If the PCI-E and controller have enough bandwidth capacity, then the write bottleneck will be the disk itself. If there is insufficient controller bandwidth capacity, then the controller becomes the bottleneck. If the bottleneck is the disks, then there is hardly any write penalty from using zfs mirrors. If the bottleneck is the controller, then you may see 1/2 the write performance due to using zfs mirrors. If you are using modern computing hardware, then the disks should be the bottleneck. Performance of HW RAID controllers is a complete unknown and they tend to modify the data so that it depends on the specific controller, which really sucks if the controller fails. It is usually better to run the controller in a JBOD mode (taking advantage of its write cache, if available) and use zfs mirrors. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs send/receive : panic and reboot
copied from opensolaris-dicuss as this probably belongs here. I kept on trying to migrate my pool with children (see previous threads) and had the (bad) idea to try the -d option on the receive part. The system reboots immediately. Here is the log in /var/adm/messages Feb 8 16:07:09 amber unix: [ID 836849 kern.notice] Feb 8 16:07:09 amber ^Mpanic[cpu1]/thread=ff014ba86e40: Feb 8 16:07:09 amber genunix: [ID 169834 kern.notice] avl_find() succeeded inside avl_add() Feb 8 16:07:09 amber unix: [ID 10 kern.notice] Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4660 genunix:avl_add+59 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c46c0 zfs:find_ds_by_guid+b9 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c46f0 zfs:findfunc+23 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c47d0 zfs:dmu_objset_find_spa+38c () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4810 zfs:dmu_objset_find+40 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4a70 zfs:dmu_recv_stream+448 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4c40 zfs:zfs_ioc_recv+41d () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4cc0 zfs:zfsdev_ioctl+175 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4d00 genunix:cdev_ioctl+45 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4d40 specfs:spec_ioctl+5a () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4dc0 genunix:fop_ioctl+7b () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4ec0 genunix:ioctl+18e () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4f10 unix:brand_sys_syscall32+1ca () Feb 8 16:07:09 amber unix: [ID 10 kern.notice] Feb 8 16:07:09 amber genunix: [ID 672855 kern.notice] syncing file systems... Feb 8 16:07:09 amber genunix: [ID 904073 kern.notice] done Feb 8 16:07:10 amber genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel Feb 8 16:07:10 amber ahci: [ID 405573 kern.info] NOTICE: ahci0: ahci_tran_reset_dport port 3 reset port Feb 8 16:07:35 amber genunix: [ID 10 kern.notice] Feb 8 16:07:35 amber genunix: [ID 665016 kern.notice] ^M100% done: 107693 pages dumped, Feb 8 16:07:35 amber genunix: [ID 851671 kern.notice] dump succeeded -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] f20 x4540
Hi, Officially it's not supported (yet?). Has anyone tried it with x4540 though? -- Robert Milkowski http://milek.blogspot.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Drive failure causes system to be unusable
Hi! I have a OSOL box as a home file server. It has 4 1TB USB Drives and 1 TB FW-Drive attached. The USB devices are combined to a RaidZ-Pool and the FW Drive acts as a hot spare. This night, one USB drive faulted and the following happened: 1. The zpool was not accessible anymore 2. changing to a directory on the pool causes the tty to get stuck 3. no reboot was possible 4. the system had to be rebooted ungracefully by pushing the power button After reboot: 1. The zpool ran in a degraded state 2. the spare device did NOT automatically go online 3. the system did not boot to the usual run level, and no auto-boot zones where started, GDM did not start either NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c21t0d0 ONLINE 0 0 0 c22t0d0 ONLINE 0 0 0 c20t0d0 FAULTED 0 0 0 corrupted data c23t0d0 ONLINE 0 0 0 cache c18t0d0ONLINE 0 0 0 spares c16t0d0AVAIL My questions: 1. Why does the system get stuck, when a device faults? 2. Why does the hot spare not go online? (The manual says, that going online automatically is the default behavior) 3. Why does the system not boot to the usual run level, when a zpool is in a degraded state at boot time? Regards, Martin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
Thanks Dan. When I try the clone then import: pfexec zfs clone data01/san/gallardo/g...@zfs-auto-snap:monthly-2009-12-01-00:00 data01/san/gallardo/g-testandlab pfexec sbdadm import-lu /dev/zvol/rdsk/data01/san/gallardo/g-testandlab The sbdadm import-lu gives me: sbdadm: guid in use which makes sense, now that I see it. The man pages make it look like I cannot give it another GUID during the import. Any other thoughts? I *could* delete the current lu, import, get my data off and reverse the process, but that would take the current volume off line, which is not what I want to do. Thanks, Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/receive : panic and reboot
Can you please send a complete list of the actions taken: The commands you used to create the send stream, the commands used to receive the stream. Also the output of `zfs list -t all` on both the sending and receiving sides. If you were able to collect a core dump (it should be in /var/crash/hostname), it would be good to upload it. The panic you're seeing is in the code that is specific to receiving a dedup'ed stream. It's possible that you could do the migration if you turned off dedup (i.e. didn't specify -D) when creating the send stream.. However, then we wouldn't be able to diagnose and fix what appears to be a bug. The best way to get us the crash dump is to upload it here: https://supportfiles.sun.com/upload We need either both vmcore.X and unix.X OR you can just send us vmdump.X. Sometimes big uploads have mixed results, so if there is a problem some helpful hints are on http://wikis.sun.com/display/supportfiles/Sun+Support+Files+-+Help+and+Users+Guide, specifically in section 7. It's best to include your name or your initials or something in the name of the file you upload. As you might imagine we get a lot of files uploaded named vmcore.1 You might also create a defect report at http://defect.opensolaris.org/bz/ Lori On 02/08/10 09:41, Bruno Damour wrote: copied from opensolaris-dicuss as this probably belongs here. I kept on trying to migrate my pool with children (see previous threads) and had the (bad) idea to try the -d option on the receive part. The system reboots immediately. Here is the log in /var/adm/messages Feb 8 16:07:09 amber unix: [ID 836849 kern.notice] Feb 8 16:07:09 amber ^Mpanic[cpu1]/thread=ff014ba86e40: Feb 8 16:07:09 amber genunix: [ID 169834 kern.notice] avl_find() succeeded inside avl_add() Feb 8 16:07:09 amber unix: [ID 10 kern.notice] Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4660 genunix:avl_add+59 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c46c0 zfs:find_ds_by_guid+b9 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c46f0 zfs:findfunc+23 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c47d0 zfs:dmu_objset_find_spa+38c () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4810 zfs:dmu_objset_find+40 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4a70 zfs:dmu_recv_stream+448 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4c40 zfs:zfs_ioc_recv+41d () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4cc0 zfs:zfsdev_ioctl+175 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4d00 genunix:cdev_ioctl+45 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4d40 specfs:spec_ioctl+5a () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4dc0 genunix:fop_ioctl+7b () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4ec0 genunix:ioctl+18e () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4f10 unix:brand_sys_syscall32+1ca () Feb 8 16:07:09 amber unix: [ID 10 kern.notice] Feb 8 16:07:09 amber genunix: [ID 672855 kern.notice] syncing file systems... Feb 8 16:07:09 amber genunix: [ID 904073 kern.notice] done Feb 8 16:07:10 amber genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel Feb 8 16:07:10 amber ahci: [ID 405573 kern.info] NOTICE: ahci0: ahci_tran_reset_dport port 3 reset port Feb 8 16:07:35 amber genunix: [ID 10 kern.notice] Feb 8 16:07:35 amber genunix: [ID 665016 kern.notice] ^M100% done: 107693 pages dumped, Feb 8 16:07:35 amber genunix: [ID 851671 kern.notice] dump succeeded ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
Use create-lu to give the clone a different GUID: sbdadm create-lu /dev/zvol/rdsk/data01/san/gallardo/g-testandlab -- Dave On 2/8/10 10:34 AM, Scott Meilicke wrote: Thanks Dan. When I try the clone then import: pfexec zfs clone data01/san/gallardo/g...@zfs-auto-snap:monthly-2009-12-01-00:00 data01/san/gallardo/g-testandlab pfexec sbdadm import-lu /dev/zvol/rdsk/data01/san/gallardo/g-testandlab The sbdadm import-lu gives me: sbdadm: guid in use which makes sense, now that I see it. The man pages make it look like I cannot give it another GUID during the import. Any other thoughts? I *could* delete the current lu, import, get my data off and reverse the process, but that would take the current volume off line, which is not what I want to do. Thanks, Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/receive : panic and reboot
Lori Alt wrote: Can you please send a complete list of the actions taken: The commands you used to create the send stream, the commands used to receive the stream. Also the output of `zfs list -t all` on both the sending and receiving sides. If you were able to collect a core dump (it should be in /var/crash/hostname), it would be good to upload it. If it does not exist, just create it mkdir -p /var/crash/`uname -n` and then run 'savecore'. The panic you're seeing is in the code that is specific to receiving a dedup'ed stream. It's possible that you could do the migration if you turned off dedup (i.e. didn't specify -D) when creating the send stream.. However, then we wouldn't be able to diagnose and fix what appears to be a bug. The best way to get us the crash dump is to upload it here: https://supportfiles.sun.com/upload We need either both vmcore.X and unix.X OR you can just send us vmdump.X. Sometimes big uploads have mixed results, so if there is a problem some helpful hints are on http://wikis.sun.com/display/supportfiles/Sun+Support+Files+-+Help+and+Users+Guide, specifically in section 7. You may consider compressing vmdump.X further with e.g. 7z archiver 7z a vmdump.X.7z vmdump.X It's best to include your name or your initials or something in the name of the file you upload. As you might imagine we get a lot of files uploaded named vmcore.1 You might also create a defect report at http://defect.opensolaris.org/bz/ Lori On 02/08/10 09:41, Bruno Damour wrote: copied from opensolaris-dicuss as this probably belongs here. I kept on trying to migrate my pool with children (see previous threads) and had the (bad) idea to try the -d option on the receive part. The system reboots immediately. Here is the log in /var/adm/messages Feb 8 16:07:09 amber unix: [ID 836849 kern.notice] Feb 8 16:07:09 amber ^Mpanic[cpu1]/thread=ff014ba86e40: Feb 8 16:07:09 amber genunix: [ID 169834 kern.notice] avl_find() succeeded inside avl_add() Feb 8 16:07:09 amber unix: [ID 10 kern.notice] Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4660 genunix:avl_add+59 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c46c0 zfs:find_ds_by_guid+b9 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c46f0 zfs:findfunc+23 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c47d0 zfs:dmu_objset_find_spa+38c () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4810 zfs:dmu_objset_find+40 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4a70 zfs:dmu_recv_stream+448 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4c40 zfs:zfs_ioc_recv+41d () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4cc0 zfs:zfsdev_ioctl+175 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4d00 genunix:cdev_ioctl+45 () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4d40 specfs:spec_ioctl+5a () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4dc0 genunix:fop_ioctl+7b () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4ec0 genunix:ioctl+18e () Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] ff00053c4f10 unix:brand_sys_syscall32+1ca () Feb 8 16:07:09 amber unix: [ID 10 kern.notice] Feb 8 16:07:09 amber genunix: [ID 672855 kern.notice] syncing file systems... Feb 8 16:07:09 amber genunix: [ID 904073 kern.notice] done Feb 8 16:07:10 amber genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel Feb 8 16:07:10 amber ahci: [ID 405573 kern.info] NOTICE: ahci0: ahci_tran_reset_dport port 3 reset port Feb 8 16:07:35 amber genunix: [ID 10 kern.notice] Feb 8 16:07:35 amber genunix: [ID 665016 kern.notice] ^M100% done: 107693 pages dumped, Feb 8 16:07:35 amber genunix: [ID 851671 kern.notice] dump succeeded ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
Sure, but that will put me back into the original situation. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL + L2ARC SSD Setup
To add to Bob's notes... On Feb 8, 2010, at 8:37 AM, Bob Friesenhahn wrote: On Mon, 8 Feb 2010, Felix Buenemann wrote: I was under the impression, that using HW RAID10 would save me 50% PCI bandwidth and allow the controller to more intelligently handle its cache, so I sticked with it. But I should run some benchmarks in RAID10 vs. JBOD with ZFS mirrors to see if this makes a difference. The answer to this is it depends. If the PCI-E and controller have enough bandwidth capacity, then the write bottleneck will be the disk itself. If you have HDDs, the write bandwidth bottleneck will be the disk. If there is insufficient controller bandwidth capacity, then the controller becomes the bottleneck. We don't tend to see this for HDDs, but SSDs can crush a controller and channel. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive failure causes system to be unusable
On Feb 8, 2010, at 9:05 AM, Martin Mundschenk wrote: Hi! I have a OSOL box as a home file server. It has 4 1TB USB Drives and 1 TB FW-Drive attached. The USB devices are combined to a RaidZ-Pool and the FW Drive acts as a hot spare. This night, one USB drive faulted and the following happened: 1. The zpool was not accessible anymore 2. changing to a directory on the pool causes the tty to get stuck 3. no reboot was possible 4. the system had to be rebooted ungracefully by pushing the power button After reboot: 1. The zpool ran in a degraded state 2. the spare device did NOT automatically go online 3. the system did not boot to the usual run level, and no auto-boot zones where started, GDM did not start either NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 c21t0d0 ONLINE 0 0 0 c22t0d0 ONLINE 0 0 0 c20t0d0 FAULTED 0 0 0 corrupted data c23t0d0 ONLINE 0 0 0 cache c18t0d0ONLINE 0 0 0 spares c16t0d0AVAIL My questions: 1. Why does the system get stuck, when a device faults? Are you sure there is not another fault here? What does svcs -xv show? -- richard 2. Why does the hot spare not go online? (The manual says, that going online automatically is the default behavior) 3. Why does the system not boot to the usual run level, when a zpool is in a degraded state at boot time? Regards, Martin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
Only with the zdb(1M) tool but note that the checksums are NOT of files but of the ZFS blocks. Thanks - bocks, right (doh) - thats what I was missing. Damn it would be so nice :( -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
Ah, I didn't see the original post. If you're using an old COMSTAR version prior to build 115, maybe the metadata placed at the first 64K of the volume is causing problems? http://mail.opensolaris.org/pipermail/storage-discuss/2009-September/007192.html The clone and create-lu process works for mounting cloned volumes under linux with b130. I don't have any windows clients to test with. -- Dave On 2/8/10 11:23 AM, Scott Meilicke wrote: Sure, but that will put me back into the original situation. -Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
That is likely it. I create the volume using 2009.06, then later upgraded to 124. I just now created a new zvol, connected it to my windows server, formatted, and added some data. Then I snapped the zvol, cloned the snap, and used 'pfexec sbdadm create-lu'. When presented to the windows server, it behaved as expected. I could see the data I created prior to the snapshot. Thank you very much Dave (and everyone else). Now, -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 'secure erase'
nw == Nicolas Williams nicolas.willi...@sun.com writes: ch == c hanover chano...@umich.edu writes: Trying again: ch In our particular case, there won't be ch snapshots of destroyed filesystems (I create the snapshots, ch and destroy them with the filesystem). Right, but if your zpool is above a zvol vdev (ex COMSTAR on another box), then someone might take a snapshot of the encrypted zvol. Then after you ``securely delete'' a filesystem by overwriting various intermediate keys or whatever, they might roll back the zvol snapshot to undelete. Yes, you still need the passphrase to reach what they've undeleted, but that's always true---what's ``secure delete'' supposed to mean besides the ability to permanently remove one dataset but not others, even from those who posess the passphrase? Otherwise it would not be a feature. It would just be a suggestion: ``forget your passphrase.'' nw ZFS crypto over zvols and what not presents no additional nw problems. If you are counting on the ability to forget a key by overwriting the block of vdev in which the key's stored, then doing it over zvol's is an additional problem. but for SSD, Even if you do not have snapshots, SSD's are CoW internally so they have something like latent snapshots from an attacker's perspective. That is the point of my zvol example, which you are losing in your ``zvol's are just like devices, that's `abstraction,' I don't have to think about it.'' ex., if your data lifecycle includes the idea that the ZFS crypto user will securely delete things from devices before sent back for warranty repair or reallocated to another group, whether it's SAN LUN's or SSD's or zvol's or anythnig that has a copy-on-write character, then from now on there is no such thing as overwriting. There is only forgetting passphrases. This is both the case for using crypto in the first place (overwriting blocks is no longer useful. Devices no longer offer any command that can really erase them.), and also the limitation of any ``Secure delete'' feature. Example, my Chinese friend gives me a USB token and tells me the passphrase. It has 'blah' stored on it. I create zfs filesystem 'blergh', write secret stuff to it, then ``securely delete'' it. I return the token to my friend without fear the contents of 'blergh' could escape because you've promised I've ``securely'' deleted it. He takes the token to its manufacturer, loads diagnostic firmware, rolls back the USB key to an earlier state using its CoW wear-leveling feature, and recovers the ``securely deleted'' dataset. so in these cases ``secure delete'' is meaningless. USB tokens are common, and I don't know what is the use case of a ``secure delete'' feature rather than simply ``using passphrasese'', if not this one. zfs crypto overall is not meaningless, but it depends on the passphrase and is granular at whatever is protected by that passphrase, no smaller, once CoW underneath. If you have, (1) ability to change the passphrase whenver you like, and (2) the passphrase can be not just a string a user types but it can include a block of data read off a token, like LUKS, then with a little bit of care you can have back secure erase over CoW backing store. It depends on your ability to securely destroy the old block of key material on this token when you change the passphrase and be sure no one's saved an old copy of it. That's what I meant by keystore outside the vdev structure. Another scenario requiring something like secure delete which is complicated by SSD's and zvol's underneath is to protect laptops crossing borders. You may wish to make known that you routinely revoke owners' access to their laptop drives prevent customs agents from trying to harrass/detain people into handing over their passphrases. You might do this by changing the passphrase before travel then delivering the new passphrase to yourself over VPN once you've passed customs. Then, you can safely give the old passphrase, which is all you know. If the laptop contains an SSD then the old passphrase is probably still useful to a customs agent who can extract dirty blocks from beneath the SSD-fs, so you lose. For the second scenario, the holy grail feature would be to have two zpools on one vdev, encrypted with different keys. zpool A will have a 'balloon' dataset reserving the blocks used by zpool B. zpool A will have an encrypted ueberblock free of magic numbers and in a nonstandard location, and will be the one with secret data on it. zpool B will be normal and contain no holygrail features, should be preloaded by you with an earlier snapshot of zpool A. If you were to start using zpool B it would quietly overwrite and corrupt parts of zpool A. so, the process would be: zpool create B ssd load with nonsecret but bootable stuff zpool export B zpool create -holygrailfeature -o tokenfile=/tmp/A-token A ssd automatically makes balloon dataset reserving used blocks of B possibly stores
Re: [zfs-discuss] Pool import with failed ZIL device now possible ?
ck == Christo Kutrovsky kutrov...@pythian.com writes: djm == Darren J Moffat darr...@opensolaris.org writes: kth == Kjetil Torgrim Homme kjeti...@linpro.no writes: ck The never turn off the ZIL sounds scary, but if the only ck consequences are 15 (even 45) seconds of data loss .. i am ck willing to take this for my home environment. djm You have done a risk analysis and if you are happy that your djm NTFS filesystems could be corrupt on those ZFS ZVOLs if you djm lose data then you could consider turning off the ZIL. yeah I wonder if this might have more to do with write coalescing and reordering within the virtualizing package's userland, though? Disabling ZIL-writing should still cause ZVOL's to recover to a crash-consistent state: so long as the NTFS was stored on a single zvol it should not become corrupt. It just might be older than you might like, right? I'm not sure it's working as well as that, just saying it's probably not disabling the ZIL that's causing whatever problems people have with guest NTFS's, right? also, you can always rollback the zvol to the latest snapshot and uncorrupt the NTFS. so this NEVER is probably too strong. especially because ZFS recovers to txg's, the need for fsync() by certain applications is actually less than it is on other filesystems that lack that characteristic and need to use fsync() as a barrier. seems silly not to exploit this. I mean, there is no guarantee writes will be executed in order, so in theory, one could corrupt it's NTFS file system. kth I think you have that guarantee, actually. +1, at least from ZFS I think you have it. It'll recover to a txg commit which is a crash-consistent point-in-time snapshot w.r.t. to when the writes were submitted to it. so as long as they aren't being reordered by something above ZFS... kth I think you need to reboot the client so that its RAM cache is kth cleared before any other writes are made. yeah it needs to understand the filesystem was force-unmounted, and the only way to tell it so is to yank the virtual cord. djm For what it's worth I personally run with the ZIL disabled on djm my home NAS system which is serving over NFS and CIFS to djm various clients, but I wouldn't recommend it to anyone. The djm reason I say never to turn off the ZIL is because in most djm environments outside of home usage it just isn't worth the djm risk to do so (not even for a small business). yeah ok but IMHO you are getting way too much up in other people's business, assuming things about them, by saying this. these dire warnings of NEVER are probably what's led to this recurring myth that disabling ZIL-writing can lead to pool corruption when it can't. pgpI9mKkUHVuo.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cores vs. Speed?
enh == Edward Ned Harvey sola...@nedharvey.com writes: enh As for mac access via nfs, automounter, etc ... I found that enh the UID/GID / posix permission bits were a problem, and I enh found it was easier and more reliable for the macs to use SMB I found it much less reliable, if by reliable you mean not losing data. There's a questionable GUI feature that throws up a [Disconnect] window whenever a normal unix system would say 'not responding still trying', but so long as you ignore this window instead of pressing what seems to be the only button, the old Unix feature of ``server can reboot without losing client writes'' seems to still be there. SMB, not so much. There's also questions of case sensitivity, locking, being mounted at boot time rather than login time, accomodating more than one user. I've also heard SMB is far slower. The Macs I've switched to automounted NFS are causing me less trouble. If you are in a ``share almost everything'' situation, just add umask 000 to /etc/launchd.conf and reboot. pgpQnaWJ6VGUM.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
I plan on filing a support request with Sun, and will try to post back with any results. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Install/boot OS from ZFS iscsi target
Is it possible to install and boot MS windows 7 from zfs iscsi target? What about Linux or even Solaris? Do the installation DVDs of these OS. have sufficient drivers to install it on an iscsi target. Please share if there is a document available. Thanks, -- Amer Ather Senior Staff Engineer Solaris Kernel Global Services Delivery amer.at...@sun.com 408-276-9780 (x19780) If you fail to prepare, prepare to fail ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL + L2ARC SSD Setup
On Mon, 8 Feb 2010, Richard Elling wrote: If there is insufficient controller bandwidth capacity, then the controller becomes the bottleneck. We don't tend to see this for HDDs, but SSDs can crush a controller and channel. It is definitely seen with older PCI hardware. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool list size
Hi, This may well have been covered before but I've not been able to find an answer to this particular question. I've setup a raidz2 test env using files like this: # mkfile 1g t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 s1 s2 # zpool create dataPool raidz2 /xvm/t1 /xvm/t2 /xvm/t3 /xvm/t4 /xvm/t5 # zpool add dataPool raidz2 /xvm/t6 /xvm/t7 /xvm/t8 /xvm/t9 /xvm/t10 # zpool add dataPool spare /xvm/s1 /xvm/s2 # zpool status dataPool pool: dataPool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM dataPool ONLINE 0 0 0 raidz2-0ONLINE 0 0 0 /xvm/t1 ONLINE 0 0 0 /xvm/t2 ONLINE 0 0 0 /xvm/t3 ONLINE 0 0 0 /xvm/t4 ONLINE 0 0 0 /xvm/t5 ONLINE 0 0 0 raidz2-1ONLINE 0 0 0 /xvm/t6 ONLINE 0 0 0 /xvm/t7 ONLINE 0 0 0 /xvm/t8 ONLINE 0 0 0 /xvm/t9 ONLINE 0 0 0 /xvm/t10 ONLINE 0 0 0 spares /xvm/s1 AVAIL /xvm/s2 AVAIL All is good and it works, I then copied a few gigs of data onto the pool and checked with zpool list r...@vmstor01:/# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT dataPool 9.94G 4.89G 5.04G49% 1.00x ONLINE - Now here's what I don't get, why does it say the poo sizel is 9.94G when it's made up of 2 x raidz2 consisting of 1G volumes, it should only be 6G which df -h also reports correctly. For a RAIDZ2 pool I find the information, the fact that it's 9.94G and not 5.9G, completely useless and misleading, why is parity part of the calculation? Also ALLOC seems wrong, there's nothing in the pool except a full copy of /usr (just to fill up with test data), it does however correctly display that I've used about 50% of the pool. This is a build 131 machine btw. r...@vmstor01:/# df -h /dataPool FilesystemSize Used Avail Use% Mounted on dataPool 5.9G 3.0G 3.0G 51% /dataPool Cheers, - Lasse ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool list size
This is a FAQ, but the FAQ is not well maintained :-( http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq On Feb 8, 2010, at 1:35 PM, Lasse Osterild wrote: Hi, This may well have been covered before but I've not been able to find an answer to this particular question. I've setup a raidz2 test env using files like this: # mkfile 1g t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 s1 s2 # zpool create dataPool raidz2 /xvm/t1 /xvm/t2 /xvm/t3 /xvm/t4 /xvm/t5 # zpool add dataPool raidz2 /xvm/t6 /xvm/t7 /xvm/t8 /xvm/t9 /xvm/t10 # zpool add dataPool spare /xvm/s1 /xvm/s2 # zpool status dataPool pool: dataPool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM dataPool ONLINE 0 0 0 raidz2-0ONLINE 0 0 0 /xvm/t1 ONLINE 0 0 0 /xvm/t2 ONLINE 0 0 0 /xvm/t3 ONLINE 0 0 0 /xvm/t4 ONLINE 0 0 0 /xvm/t5 ONLINE 0 0 0 raidz2-1ONLINE 0 0 0 /xvm/t6 ONLINE 0 0 0 /xvm/t7 ONLINE 0 0 0 /xvm/t8 ONLINE 0 0 0 /xvm/t9 ONLINE 0 0 0 /xvm/t10 ONLINE 0 0 0 spares /xvm/s1 AVAIL /xvm/s2 AVAIL All is good and it works, I then copied a few gigs of data onto the pool and checked with zpool list r...@vmstor01:/# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT dataPool 9.94G 4.89G 5.04G49% 1.00x ONLINE - Now here's what I don't get, why does it say the poo sizel is 9.94G when it's made up of 2 x raidz2 consisting of 1G volumes, it should only be 6G which df -h also reports correctly. No, zpool displays the available pool space. df -h displays something else entirely. If you have 10 1GB vdevs, then the total available pool space is 10GB. From the zpool(1m) man page: ... size Total size of the storage pool. These space usage properties report actual physical space available to the storage pool. The physical space can be different from the total amount of space that any contained datasets can actually use. The amount of space used in a raidz configuration depends on the characteristics of the data being written. In addition, ZFS reserves some space for internal accounting that the zfs(1M) command takes into account, but the zpool command does not. For non-full pools of a reasonable size, these effects should be invisible. For small pools, or pools that are close to being completely full, these discrepancies may become more noticeable. ... -- richard For a RAIDZ2 pool I find the information, the fact that it's 9.94G and not 5.9G, completely useless and misleading, why is parity part of the calculation? Also ALLOC seems wrong, there's nothing in the pool except a full copy of /usr (just to fill up with test data), it does however correctly display that I've used about 50% of the pool. This is a build 131 machine btw. r...@vmstor01:/# df -h /dataPool FilesystemSize Used Avail Use% Mounted on dataPool 5.9G 3.0G 3.0G 51% /dataPool Cheers, - Lasse ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NFS access by OSX clients (was Cores vs. Speed?)
There's also questions of case sensitivity, locking, being mounted at boot time rather than login time, accomodating more than one user. I've also heard SMB is far slower. The Macs I've switched to automounted NFS are causing me less trouble. If you are in a ``share almost everything'' situation, just add umask 000 to /etc/launchd.conf and reboot. How are you managing UID's on the NFS server? If user eharvey connects to server from client Mac A, or Mac B, or Windows 1, or Windows 2, or any of the linux machines ... the server has to know it's eharvey, and assign the correct UID's etc. When I did this in the past, I maintained a list of users in AD, and duplicate list of users in OD, so the mac clients could resolve names to UID's via OD. And a third duplicate list in NIS so the linux clients could resolve. It was terrible. You must be doing something better? How do you manage your NFS exports? Do all the clients have static assigned IP's, or do you simply export to the whole subnet, or do you do something else? I would consider it a security risk, if any schmo could take any unused IP address, connect to the server, and claim to be eharvey without any problem. Also, I had a umask problem, which presumably you've got solved by the launchd.conf edit. Presumably this umask applies, whether you create a folder in Finder, or create a file in MS Word, or save a new text file from TextEdit ... The umask is applied to every file and every folder creation, regardless of which app is doing the creation, right? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On Mon, Feb 01, 2010 at 12:22:55PM -0800, Lutz Schumann wrote: Created a pool on head1 containing just the cache device (c0t0d0). This is not possible, unless there is a bug. You cannot create a pool with only a cache device. I have verified this on b131: # zpool create norealpool cache /dev/ramdisk/rc1 1 invalid vdev specification: at least one toplevel l vdev must be specified This is also consistent with the notion that cache devices are auxiliary devices and do not have pool configuration information in the label. Sorry for the confustion ... a little misunderstanding. I created a Pool who's only data disk is the disk formally used as cache device in the pool that switched. Then I exported this pool mad eform just a single disk (data disk). And switched back. The exported pool was picked up as cache device ... this seems really problematic. This is exactly the scenario I was concerned about earlier in the thread. Thanks for confirming that it occurs. Please verify that the pool had autoreplace=off (just to avoid that distraction), and file a bug. Cache devices should not automatically destroy disk contents based solely on device path, especially where that device path came along with a pool import. Cache devices need labels to confirm their identity. This is irrespective of whether the cache contents after the label are persistent or volatile, ie should be fixed without waiting for the CR about persistent l2arc. -- Dan. pgpjdt4tg1JNp.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool list size
On 08/02/2010, at 22.50, Richard Elling wrote: r...@vmstor01:/# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT dataPool 9.94G 4.89G 5.04G49% 1.00x ONLINE - Now here's what I don't get, why does it say the poo sizel is 9.94G when it's made up of 2 x raidz2 consisting of 1G volumes, it should only be 6G which df -h also reports correctly. No, zpool displays the available pool space. df -h displays something else entirely. If you have 10 1GB vdevs, then the total available pool space is 10GB. From the zpool(1m) man page: ... size Total size of the storage pool. These space usage properties report actual physical space available to the storage pool. The physical space can be different from the total amount of space that any contained datasets can actually use. The amount of space used in a raidz configuration depends on the characteristics of the data being written. In addition, ZFS reserves some space for internal accounting that the zfs(1M) command takes into account, but the zpool command does not. For non-full pools of a reasonable size, these effects should be invisible. For small pools, or pools that are close to being completely full, these discrepancies may become more noticeable. ... -- richard Ok thanks I know that the amount of used space will vary, but what's the usefulness of the total size when ie in my pool above 4 x 1G (roughly, depending on recordsize) are reserved for parity, it's not like it's useable for anything else :) I just don't see the point when it's a raidz or raidz2 pool, but I guess I am missing something here. Cheers, - Lasse ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS access by OSX clients
enh == Edward Ned Harvey macenterpr...@nedharvey.com writes: enh How are you managing UID's on the NFS server? All the macs are installed from the same image using asr. And for the most part, there's just one user, except where there isn't, and then I manage uid's by hand. enh When I did this in the past, I maintained a list of users in enh AD, and duplicate list of users in OD, so the mac clients enh could resolve names to UID's via OD. And a third duplicate enh list in NIS so the linux clients could resolve. It was enh terrible. Why is that terrible? Is it impossible to automate because of the AD piece? OD/NIS should be dumpable from SQL easily, right? If AD is the unscriptable piece, it just seems kind of sad to throw the whole thing out and standardize on the one piece that's the most convoluted and brittle and least automatable, instead of the other way around. enh How do you manage your NFS exports? [...] export to the whole enh subnet yeah, that. r...@1.2.3.0/24 there is a highly stupid bug that would crash mountd for NFSv4 or get incorrect refusal for NFSv3 if the IP was not lookupable in reverse DNS or /etc/hosts. but it may be fixed now because someone from nfs-discuss was unable to reproduce. enh I would consider it a security risk, if any schmo could take enh any unused IP address, connect to the server, and claim to be enh eharvey yeah there is zero security, none at all. I don't really think adding exports restrictions at a finer granularity than subnet would help much. Only Kerberos would help. but most of the security we care about comes from taking snapshots: that's the attack that's relevant here, disgruntled or confused employees deleting everything. This is a robust kind of security, not MM model. also every desktop has a read-only copy of yesterday's shared filesystem, from another nfs server populated with rsync, pre-mounted, in case of problems with the writeable one. At least it is not crap security like SMB, with five or ten wildly different variants and password formats operating on different ports some with MAC session-binding some without. I admit SMB has some security rather than none, but it's a slow crashy clumsy caveat-laden protocol. You might also look at it this way: if there's going to be a panic/DoS or exploitable buffer overflow security problem, it's far more likely to be in the SMB stack than the NFS stack. (that said, 'mknod file b 14 n' seems to panic a Solaris NFS server, at least b71.) enh solved by the launchd.conf edit. Presumably this umask enh applies, whether you create a folder in Finder, or create a enh file in MS Word, or save a new text file from TextEdit ... The enh umask is applied to every file and every folder creation, enh regardless of which app is doing the creation, right? right. This much works perfectly AFAICT. I suppose if you have a user database and want private user folders, you just make them owned by that user and chmod 700. At least that much works everywhere and survives backup, unlike this complete disaster that is ACL's. I get it, the NFSv3 featureset with no text usernames and no Kerberos unchanged in two decades is not a reasonable answer to modern expectations, and NIS is no longer the unifying directory service it once was now that Mac is a credible client. AD can go fuck itself: buy a windows server and another sysadmin to manage it, or suffer the polluting effect it has on your mind and your entire operation. but, yeah, NFSv3 is not enough. It's zero-security simplicity turns out to be exactly what we need here though, and the Mac client with automounter 10.5 or later, is extremely solid, more than the other Mac filesystems or GlobalSAN. pgpeQgCjtTO0o.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
On Mon, Feb 08, 2010 at 11:24:56AM -0800, Lutz Schumann wrote: Only with the zdb(1M) tool but note that the checksums are NOT of files but of the ZFS blocks. Thanks - bocks, right (doh) - thats what I was missing. Damn it would be so nice :( If you're comparing the current data to a snapshot baseline on the same pool, it just means you need to compare more checksums (several per file), it doesn't invalidate the idea. There may also be other ways of checking quickly that the file data is unmodified since snapshot X, but again it will require looking at zfs internals. This is far from the first use case for an official interface to get at this kind of data. It's quite similar to the question of how to verify send|recv integrity from yesterday, for example. As yet I don't know of a concrete proposal of what such an interface should look like (since there's nothing to borrow from POSIX), let alone an implementation. It more complicated if you're comparing checksums against an external baseline reference (such as from a build) because block sizes and checksum algorithms may vary between pools. However, as you note, that's already catered for by existing tools, so they could work together. -- Dan. pgp3k63UrsurG.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool list size
On Mon, Feb 08, 2010 at 11:28:11PM +0100, Lasse Osterild wrote: Ok thanks I know that the amount of used space will vary, but what's the usefulness of the total size when ie in my pool above 4 x 1G (roughly, depending on recordsize) are reserved for parity, it's not like it's useable for anything else :) I just don't see the point when it's a raidz or raidz2 pool, but I guess I am missing something here. The basis of raidz is that each block is its own raid stripe, with its own layout. At present, this only matters for the size of the stripe. For example, if I write a single 512-byte block, to a dual-parity raidz2, I will write three blocks, to three disks. With a larger block, I will have more data over more disks, until the block is big enough to stripe evenly over all of them. As the block gets bigger yet, more is written to each disk as part of the stripe, and the parity units get bigger to match the size of the largest data unit. This rounding can very often mean that different disks have different amounts of data for each stripe. Crucially, it also means the ratio of parity-to-data is not fixed. This tends to average out on a pool with lots of data and mixed block sizes, but not always; consider an extreme case of a pool containing only datasets with blocksize=512. That's what the comments in the documentation are referring to, and the major reason for the zpool output you see. In future, it may go further and be more important. Just as the data count per stripe can vary, there's nothing fundamental in the raidz layout that says that the same parity count and method has to be used for the entire pool, either. Raidz already degrades to simple mirroring in some of the same small-stripe cases discussed above. There's no particular reason, in theory, why they could not also have different amounts of parity on a per-block basis. I imagine that when bp-rewrite and the ability to reshape pools comes along, this will indeed be the case, at least during transition. As a simple example, when reshaping a raidz1 to a raidz2 by adding a disk, there will be blocks with single parity and other blocks with dual for a time until the operation is finished. Maybe one day in the future, there will just be a basic raidz vdev type, and we can set dataset properties for the number of additional parity blocks each should get. This might be a little like we can currently set copies, including that it would only affect new writes and lead to very mixed redundancy states. Noone has actually said this is a real goal, and the reasons it's not presently allowed include administrative and operational simplicity as well as implementation and testing constraints, but I think it would be handy and cool. -- Dan. pgpJKzcDxWcE8.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL + L2ARC SSD Setup
Am 08.02.10 22:23, schrieb Bob Friesenhahn: On Mon, 8 Feb 2010, Richard Elling wrote: If there is insufficient controller bandwidth capacity, then the controller becomes the bottleneck. We don't tend to see this for HDDs, but SSDs can crush a controller and channel. It is definitely seen with older PCI hardware. Well to make things short: Using JBOD + ZFS Striped Mirrors vs. controller's RAID10, dropped the max. sequential read I/O from over 400 MByte/s to below 300 MByte/s. However random I/O and sequential writes seemed to perform equally well. One thing however was mucbh better using ZFS mirrors: random seek performance was about 4 times higher, so I guess for random I/O on a busy system the JBOD would win. The controller can deliver 800 MByte/s on cache hits and is connected with PCIe x8, so theoretically it should have enough PCI bandwidth. It's cpu is the older 500MHz IOP333, so it has less power than the newer IOP348 controllers with 1.2GHZ cpus. Too bad I have no choice but using HW RAID, because the mainboard bios only supports 7 boot devices, so it can't boot from the right disk if the Areca is in JBOD and I found no way to disable the controllers BIOS. Well maybe I could flash the EFI BIOS to work around this... (I've done my tests by reconfiguring the controller at runtime.) Bob - Felix ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool list size
Hi Richard, I last updated this FAQ on 1/19. Which part is not well-maintained? :-) Cindy On 02/08/10 14:50, Richard Elling wrote: This is a FAQ, but the FAQ is not well maintained :-( http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq On Feb 8, 2010, at 1:35 PM, Lasse Osterild wrote: Hi, This may well have been covered before but I've not been able to find an answer to this particular question. I've setup a raidz2 test env using files like this: # mkfile 1g t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 s1 s2 # zpool create dataPool raidz2 /xvm/t1 /xvm/t2 /xvm/t3 /xvm/t4 /xvm/t5 # zpool add dataPool raidz2 /xvm/t6 /xvm/t7 /xvm/t8 /xvm/t9 /xvm/t10 # zpool add dataPool spare /xvm/s1 /xvm/s2 # zpool status dataPool pool: dataPool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM dataPool ONLINE 0 0 0 raidz2-0ONLINE 0 0 0 /xvm/t1 ONLINE 0 0 0 /xvm/t2 ONLINE 0 0 0 /xvm/t3 ONLINE 0 0 0 /xvm/t4 ONLINE 0 0 0 /xvm/t5 ONLINE 0 0 0 raidz2-1ONLINE 0 0 0 /xvm/t6 ONLINE 0 0 0 /xvm/t7 ONLINE 0 0 0 /xvm/t8 ONLINE 0 0 0 /xvm/t9 ONLINE 0 0 0 /xvm/t10 ONLINE 0 0 0 spares /xvm/s1 AVAIL /xvm/s2 AVAIL All is good and it works, I then copied a few gigs of data onto the pool and checked with zpool list r...@vmstor01:/# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT dataPool 9.94G 4.89G 5.04G49% 1.00x ONLINE - Now here's what I don't get, why does it say the poo sizel is 9.94G when it's made up of 2 x raidz2 consisting of 1G volumes, it should only be 6G which df -h also reports correctly. No, zpool displays the available pool space. df -h displays something else entirely. If you have 10 1GB vdevs, then the total available pool space is 10GB. From the zpool(1m) man page: ... size Total size of the storage pool. These space usage properties report actual physical space available to the storage pool. The physical space can be different from the total amount of space that any contained datasets can actually use. The amount of space used in a raidz configuration depends on the characteristics of the data being written. In addition, ZFS reserves some space for internal accounting that the zfs(1M) command takes into account, but the zpool command does not. For non-full pools of a reasonable size, these effects should be invisible. For small pools, or pools that are close to being completely full, these discrepancies may become more noticeable. ... -- richard For a RAIDZ2 pool I find the information, the fact that it's 9.94G and not 5.9G, completely useless and misleading, why is parity part of the calculation? Also ALLOC seems wrong, there's nothing in the pool except a full copy of /usr (just to fill up with test data), it does however correctly display that I've used about 50% of the pool. This is a build 131 machine btw. r...@vmstor01:/# df -h /dataPool FilesystemSize Used Avail Use% Mounted on dataPool 5.9G 3.0G 3.0G 51% /dataPool Cheers, - Lasse ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool list size
On 09/02/2010, at 00.23, Daniel Carosone wrote: On Mon, Feb 08, 2010 at 11:28:11PM +0100, Lasse Osterild wrote: Ok thanks I know that the amount of used space will vary, but what's the usefulness of the total size when ie in my pool above 4 x 1G (roughly, depending on recordsize) are reserved for parity, it's not like it's useable for anything else :) I just don't see the point when it's a raidz or raidz2 pool, but I guess I am missing something here. The basis of raidz is that each block is its own raid stripe, with its own layout. At present, this only matters for the size of the stripe. For example, if I write a single 512-byte block, to a dual-parity raidz2, I will write three blocks, to three disks. With a larger block, I will have more data over more disks, until the block is big enough to stripe evenly over all of them. As the block gets bigger yet, more is written to each disk as part of the stripe, and the parity units get bigger to match the size of the largest data unit. This rounding can very often mean that different disks have different amounts of data for each stripe. Crucially, it also means the ratio of parity-to-data is not fixed. This tends to average out on a pool with lots of data and mixed block sizes, but not always; consider an extreme case of a pool containing only datasets with blocksize=512. That's what the comments in the documentation are referring to, and the major reason for the zpool output you see. In future, it may go further and be more important. Just as the data count per stripe can vary, there's nothing fundamental in the raidz layout that says that the same parity count and method has to be used for the entire pool, either. Raidz already degrades to simple mirroring in some of the same small-stripe cases discussed above. There's no particular reason, in theory, why they could not also have different amounts of parity on a per-block basis. I imagine that when bp-rewrite and the ability to reshape pools comes along, this will indeed be the case, at least during transition. As a simple example, when reshaping a raidz1 to a raidz2 by adding a disk, there will be blocks with single parity and other blocks with dual for a time until the operation is finished. Maybe one day in the future, there will just be a basic raidz vdev type, and we can set dataset properties for the number of additional parity blocks each should get. This might be a little like we can currently set copies, including that it would only affect new writes and lead to very mixed redundancy states. Noone has actually said this is a real goal, and the reasons it's not presently allowed include administrative and operational simplicity as well as implementation and testing constraints, but I think it would be handy and cool. -- Dan. Thanks Dan! :) That explanation made perfect sense and I appreciate you taking the time to write this, perhaps parts of it could go into the FAQ ? I realise that it's sort of in there already but it doesn't explain it very well. Cheers, - Lasse ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool list size
Hi Lasse, I expanded this entry to include more details of the zpool list and zfs list reporting. See if the new explanation provides enough details. Thanks, Cindy On 02/08/10 16:51, Lasse Osterild wrote: On 09/02/2010, at 00.23, Daniel Carosone wrote: On Mon, Feb 08, 2010 at 11:28:11PM +0100, Lasse Osterild wrote: Ok thanks I know that the amount of used space will vary, but what's the usefulness of the total size when ie in my pool above 4 x 1G (roughly, depending on recordsize) are reserved for parity, it's not like it's useable for anything else :) I just don't see the point when it's a raidz or raidz2 pool, but I guess I am missing something here. The basis of raidz is that each block is its own raid stripe, with its own layout. At present, this only matters for the size of the stripe. For example, if I write a single 512-byte block, to a dual-parity raidz2, I will write three blocks, to three disks. With a larger block, I will have more data over more disks, until the block is big enough to stripe evenly over all of them. As the block gets bigger yet, more is written to each disk as part of the stripe, and the parity units get bigger to match the size of the largest data unit. This rounding can very often mean that different disks have different amounts of data for each stripe. Crucially, it also means the ratio of parity-to-data is not fixed. This tends to average out on a pool with lots of data and mixed block sizes, but not always; consider an extreme case of a pool containing only datasets with blocksize=512. That's what the comments in the documentation are referring to, and the major reason for the zpool output you see. In future, it may go further and be more important. Just as the data count per stripe can vary, there's nothing fundamental in the raidz layout that says that the same parity count and method has to be used for the entire pool, either. Raidz already degrades to simple mirroring in some of the same small-stripe cases discussed above. There's no particular reason, in theory, why they could not also have different amounts of parity on a per-block basis. I imagine that when bp-rewrite and the ability to reshape pools comes along, this will indeed be the case, at least during transition. As a simple example, when reshaping a raidz1 to a raidz2 by adding a disk, there will be blocks with single parity and other blocks with dual for a time until the operation is finished. Maybe one day in the future, there will just be a basic raidz vdev type, and we can set dataset properties for the number of additional parity blocks each should get. This might be a little like we can currently set copies, including that it would only affect new writes and lead to very mixed redundancy states. Noone has actually said this is a real goal, and the reasons it's not presently allowed include administrative and operational simplicity as well as implementation and testing constraints, but I think it would be handy and cool. -- Dan. Thanks Dan! :) That explanation made perfect sense and I appreciate you taking the time to write this, perhaps parts of it could go into the FAQ ? I realise that it's sort of in there already but it doesn't explain it very well. Cheers, - Lasse ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool/zfs history does not record version upgrade events
zpool/zfs history does not record version upgrade events, those seem like important events worth keeping in either the public or internal history. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
This is a long thread, with lots of interesting and valid observations about the organisation of the industry, the segmentation of the market, getting what you pay for vs paying for what you want, etc. I don't really find within, however, an answer to the original question, at least the way I read it. Perhaps that's the issue - that the question was asked without enough specifics and context, and so everyone has their own interpretation and their own answer to their own question. Remembering that a lot of this was branded and marketed as open storage, the desire to mix and match components is not only natural, a clear expectation has been set that it should be possible and easy and open. That's not to say that you can expect to have your cake and eat it too. Certain combinations and permutations are more qualified, tested, supported and therefore expensive than others; these characteristics are part of what you should be able to mix and match, understanding the full implications of each tradeoff choice. Snorcle wants to sell hardware. Sure, they want even more to sell a complete hardware and annual maintenance package with annuity revenue over multiple years with high markups. Some people are simply not customers for all of that, but might still be customers for the hardware. Especially these days, it seems they still would want to sell the hardware even when they can't sell the rest of the package. I read the following context between the lines of the original question: - I have or can source disk drives I'm comfortable using. - I understand that I'm not paying for, and can't expect, commercial support for whatever final combination I wind up with. - I am comfortable relying on standards and specifications for interoperability, enough that it's unlikely I'll have to get into deep debugging for problems. At least, I'm unwilling or unable to pay high premiums ahead of time in the hope of avoiding potential high costs for later problems. - The J4500 seems like nice hardware, and I know that at least it isn't likely to change unexpectedly to some different chipset not recognised by opensolaris, just before purchase. This would give me some comfort. - I like Sun, and am thankful for ZFS, and since I have to buy hardware anyway I'll look at what Sun offers. Perhaps I would even prefer to buy the Sun offering, all else being approximately equal. This would also give me some comfort. In that context, I haven't seen an answer, just a conclusion: - All else is not equal, so I give my money to some other hardware manufacturer, and get frustrated that Sun won't let me buy the parts I could use effectively and comfortably. -- Dan. pgpvZdqvo577p.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Dedup Questions.
Hi, I am loving the new dedup feature. Few questions: If you enable it after data is on the filesystem, it will find the dupes on read as well as write? Would a scrub therefore make sure the DDT is fully populated. Re the DDT, can someone outline it's structure please? Some sort of hash table? The blogs I have read so far dont specify. Re DDT size, is (data in use)/(av blocksize) * 256bit right as a worst case (ie all blocks non identical) What are average block sizes? Cheers, Tom ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool list size
On Mon, Feb 08, 2010 at 05:23:29PM -0700, Cindy Swearingen wrote: Hi Lasse, I expanded this entry to include more details of the zpool list and zfs list reporting. See if the new explanation provides enough details. Cindy, feel free to crib from or refer to my text in whatever way might help. -- Dan. pgp25J93QupLp.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL + L2ARC SSD Setup
On Tue, 9 Feb 2010, Felix Buenemann wrote: Well to make things short: Using JBOD + ZFS Striped Mirrors vs. controller's RAID10, dropped the max. sequential read I/O from over 400 MByte/s to below 300 MByte/s. However random I/O and sequential writes seemed to perform Much of the difference is likely that your controller implements true RAID10 wereas ZFS striped mirrors are actually load-shared mirrors. Since zfs does not use true striping across vdevs, it relies on sequential prefetch requests to get the sequential read rate up. Sometimes zfs's prefetch is not aggressive enough. I have observed that there may still be considerably more read performance available (to another program/thread) even while a benchmark program is reading sequentially as fast as it can. Try running two copies of your benchmark program at once and see what happens. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Anyone with experience with a PCI-X SSD card?
I've a couple of older systems that are front-ending a large backup array. I'd like to put in a large L2ARC cache device for them to use with dedup.Right now, they only have Ultra320 SCA 3.5 hot-swap drive bays, and PCI-X slots. I haven't found any SSDs (or adapters) which might work with the Ultra320 bays, so I'm hunting for something to stick in the PCI-X (NOT PCI-Express) slot. Ideally, I'd love to find something that lets me hook a standard 2.5 SSD to, but I have space limitations. About the best I've found right now is a 32-bit PCI card which has Compact Flash slots on it. /Really/ not what I want. So, I've seen a bunch of PCI-E cards which have flash on them and act as a SSD, but is there any hope for a old PCI-X slot? Anyone seen such a beast? -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
Daniel Carosone d...@geek.com.au writes: In that context, I haven't seen an answer, just a conclusion: - All else is not equal, so I give my money to some other hardware manufacturer, and get frustrated that Sun won't let me buy the parts I could use effectively and comfortably. no one is selling disk brackets without disks. not Dell, not EMC, not NetApp, not IBM, not HP, not Fujitsu, ... -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
Just like i said way earlier, The entire idea is like asking to buy a Ferrari without the aluminum wheels they sell because you think they are charging too much for them, after all, aluminum is cheap. It's just not done that way. There are OTHER OPTIONS for people who can't afford it. You really can't have both. You can either afford it or you can't. On Mon, Feb 8, 2010 at 8:36 PM, Kjetil Torgrim Homme kjeti...@linpro.nowrote: Daniel Carosone d...@geek.com.au writes: In that context, I haven't seen an answer, just a conclusion: - All else is not equal, so I give my money to some other hardware manufacturer, and get frustrated that Sun won't let me buy the parts I could use effectively and comfortably. no one is selling disk brackets without disks. not Dell, not EMC, not NetApp, not IBM, not HP, not Fujitsu, ... -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
On Monday, February 8, 2010, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Daniel Carosone d...@geek.com.au writes: In that context, I haven't seen an answer, just a conclusion: - All else is not equal, so I give my money to some other hardware manufacturer, and get frustrated that Sun won't let me buy the parts I could use effectively and comfortably. no one is selling disk brackets without disks. not Dell, not EMC, not NetApp, not IBM, not HP, not Fujitsu, ... -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Although I am in full support of what sun is doing, to play devils advocate: supermicro is. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
On Mon, Feb 8, 2010 at 9:13 PM, Tim Cook t...@cook.ms wrote: On Monday, February 8, 2010, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Daniel Carosone d...@geek.com.au writes: In that context, I haven't seen an answer, just a conclusion: - All else is not equal, so I give my money to some other hardware manufacturer, and get frustrated that Sun won't let me buy the parts I could use effectively and comfortably. no one is selling disk brackets without disks. not Dell, not EMC, not NetApp, not IBM, not HP, not Fujitsu, ... -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Although I am in full support of what sun is doing, to play devils advocate: supermicro is. This is a far cry from an apples to apples comparison though. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
Tim Cook wrote: On Monday, February 8, 2010, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Daniel Carosone d...@geek.com.au writes: In that context, I haven't seen an answer, just a conclusion: - All else is not equal, so I give my money to some other hardware manufacturer, and get frustrated that Sun won't let me buy the parts I could use effectively and comfortably. no one is selling disk brackets without disks. not Dell, not EMC, not NetApp, not IBM, not HP, not Fujitsu, .. Although I am in full support of what sun is doing, to play devils advocate: supermicro is. True, but they're not a systems vendor. They're a parts OEM. You might be able to get larger integrated solutions from them (motherboard/chassis together), but you'll have to buy the rest of the parts yourself (or go to a system integrator to build a system for you). No brand-name system provider allows you to purchase empty disk sleds. About the best I can come up with on that is that eBay often has a selection of various brackets, usually from 3rd-parties which copy the Brand design. In the end, you pay for support and integration testing. Whether it is worth it depends solely on your situation. But don't expect vendors to service all (or even many) niches - they all pick their battles, and if you're not in their zone, it's a huge uphill struggle to get them to add your zone. It's that simple. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
May be look at rsync and rsync lib (http://librsync.sourceforge.net/) code to see if a ZFS API could be design to help rsync/librsync in the future as well as diff. It might be a good idea for POSIX to have a single checksum and a multi-checksum interface. One problem could be block sizes, if a file is re-written and is the same size it may have different ZFS record sizes within, if it was written over a long period of time (txg's)(ignoring compression), and therefore you could not use ZFS checksum to compare two files. Side Note: It would be nice if ZFS on every txg only wrote full record sizes unless it was short on memory, or a file was closed. Maybe the txg could happen more often if it just scanned for full recordsize's writes and closed files. Or block which had not be altered for three scan's. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] [OT] excess zfs-discuss mailman digests
Hi. As sometimes list-owner's aren't monitored... I signed up for digests. On the mailman page it hints at once daily service. I'm getting maybe 12 per day, didn't count them. Non-overlapping, various messages counts in each. This is unexpected given the above hint. Once a day would be nice :) Thanks. PS: Is there any way to get a copy of the list since inception for local client perusal, not via some online web interface? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Big send/receive hangs on 2009.06
Nobody has any ideas? It's still hung after work. I wonder what it will take to stop the backup and export the pool? Well, that's nice; a straight kill terminated the processes, at least. zpool status shows no errors. zfs list shows backup filesystems mounted. zpool export -f is running...no disk I/O now...starting to look hung. Ah, the zfs receive process is still in the process table. kill -9 doesn't help. Kill and kill -9 won't touch the zpool export process, either. Pulling the USB cable on the drive doesn't seem to be helping any either. zfs list now hangs, but giving it a little longer just in case. Kill -9 doesn't touch any of the hung jobs. Closing the ssh sessions doesn't touch any of them either. zfs list on pools other than bup-wrack works. zpool list works, and shows bup-wrack. Attempting to set failmode=continue gives an I/O error. Plugging the USB back in and then setting failmode gives the same I/O error. cfgadm -al lists known disk drives and usb3/9 as usb-storage connected. I think that's the USB disk that's stuck. cfgadm -cremove usb3/9 failed configuration operation not supported. cfgadm -cdisconnect usb3/9 queried if I wanted to suspend activity, then failed with cannot issue devctl to ap_id: /devices/p...@0,0/pci10de,c...@2,1:9 Still -al the same. cfgadm -cunconfigure same error as disconnect. I was able to list properties on bup-wrack: bash-3.2$ zpool get all bup-wrack NAME PROPERTY VALUE SOURCE bup-wrack size 928G- bup-wrack used 438G- bup-wrack available 490G- bup-wrack capacity 47% - bup-wrack altroot/backups/bup-wrack local bup-wrack health UNAVAIL - bup-wrack guid 2209605264342513453 default bup-wrack version14 default bup-wrack bootfs - default bup-wrack delegation on default bup-wrack autoreplaceoff default bup-wrack cachefile nonelocal bup-wrack failmode waitdefault bup-wrack listsnapshots off default It's not healthy, alright. And the attempt to set failmode really did fail. I've been here before, and it has always required a reboot. Other than setting failmode=continue earlier, anybody have any ideas? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [OT] excess zfs-discuss mailman digests
On Mon, Feb 8, 2010 at 9:04 PM, grarpamp grarp...@gmail.com wrote: PS: Is there any way to get a copy of the list since inception for local client perusal, not via some online web interface? You can get monthly .gz archives in mbox format from http://mail.opensolaris.org/pipermail/zfs-discuss/. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
Although I am in full support of what sun is doing, to play devils advocate: supermicro is. They're not the only ones, although the most-often discussed here. Dell will generally sell hardware and warranty and service add-ons in any combination, to anyone willing and capable of figuring out what to order, although that effort might well be more than the result is worth. Many of the others have issues in being further from the retail market, such as support divisions that are only set up to deal with large enterprise full-service customers. Nothing wrong with that if it suits them. Of the others listed, Sun is the one promoting change and the benefits of ZFS and open storage, and which has the opportunity to make sales to an interested community. They, too, are entitled to exclude themselves from sales they don't want, for whatever reason they or their new masters choose. On Mon, Feb 08, 2010 at 09:33:12PM -0500, Thomas Burgess wrote: This is a far cry from an apples to apples comparison though. As much as I'm no fan of Apple, it's a pity they dropped ZFS because that would have brought considerable attention to the opportunity of marketing and offering zfs-suitable hardware to the consumer arena. Port-multiplier boxes already seem to be targetted most at the Apple crowd, even it's only in hope of scoring a better margin. Otherwise, bad analogies, whether about cars or fruit, don't help. -- Dan. pgpFh9oakiUNu.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
Damon Atkins damon_atk...@yahoo.com.au writes: One problem could be block sizes, if a file is re-written and is the same size it may have different ZFS record sizes within, if it was written over a long period of time (txg's)(ignoring compression), and therefore you could not use ZFS checksum to compare two files. the record size used for a file is chosen when that file is created. it can't change. when the default record size for the dataset changes, only new files will be affected. ZFS *must* write a complete record even if you change just one byte (unless it's the tail record of course), since there isn't any better granularity for the block pointers. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
On Mon, Feb 08, 2010 at 09:33:12PM -0500, Thomas Burgess wrote: This is a far cry from an apples to apples comparison though. As much as I'm no fan of Apple, it's a pity they dropped ZFS because that would have brought considerable attention to the opportunity of marketing and offering zfs-suitable hardware to the consumer arena. Port-multiplier boxes already seem to be targetted most at the Apple crowd, even it's only in hope of scoring a better margin. Otherwise, bad analogies, whether about cars or fruit, don't help. It might help people to understand how ridiculous they sound going on and on about buying a premium storage appliance without any storage. I think the car analogy was dead on. You don't have to agree with a vendors practices to understand them. If you have a more fitting analogy, then by all means lets hear it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [OT] excess zfs-discuss mailman digests
grarpamp grarp...@gmail.com writes: PS: Is there any way to get a copy of the list since inception for local client perusal, not via some online web interface? I prefer to read mailing lists using a newsreader and the NNTP interface at Gmane. a newsreader tends to be better at threading etc. than a mail client which is fed an mbox... see http://gmane.org/about.php for more information. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Anyone with experience with a PCI-X SSD card?
On Mon, Feb 8, 2010 at 10:33 PM, Erik Trimble erik.trim...@sun.com wrote: Erik Trimble wrote: I've a couple of older systems that are front-ending a large backup array. I'd like to put in a large L2ARC cache device for them to use with dedup. Right now, they only have Ultra320 SCA 3.5 hot-swap drive bays, and PCI-X slots. I haven't found any SSDs (or adapters) which might work with the Ultra320 bays, so I'm hunting for something to stick in the PCI-X (NOT PCI-Express) slot. Ideally, I'd love to find something that lets me hook a standard 2.5 SSD to, but I have space limitations. About the best I've found right now is a 32-bit PCI card which has Compact Flash slots on it. /Really/ not what I want. So, I've seen a bunch of PCI-E cards which have flash on them and act as a SSD, but is there any hope for a old PCI-X slot? Anyone seen such a beast? To reply to myself, the best I can do is this: http://www.apricorn.com/product_detail.php?type=familyid=59 (it uses a sil3124 controller, so it /might/ work with OpenSolaris ) and an award for the That-is-ALMOST-What-I-Want goes to: http://www.sonnettech.com/PRODUCT/tempohd.html The first one is really cool. I was wondering this as well btw. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Anyone with experience with a PCI-X SSD card?
On Mon, Feb 08, 2010 at 07:33:56PM -0800, Erik Trimble wrote: To reply to myself, the best I can do is this: http://www.apricorn.com/product_detail.php?type=familyid=59 (it uses a sil3124 controller, so it /might/ work with OpenSolaris ) Nice. I'd certainly like to know if you try it and have success. Note that the pci-x version also has a pci-e to pci-x bridge (Tsi384) that would need to work. I expect ppb's are handled generically by the framework and spec. I didn't find anything to indicate either way whether there was bootable bios on board; again this might be a potential hurdle with the ppb if you intend to boot from it. For me, I'd be looking at the pci-e version, and as you note there are other options for pure ssd. This seems the most modular (choosing my own brand/type/mix of ssd and hdd, for example). -- Dan. pgpqVyu2yI9G7.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Anyone with experience with a PCI-X SSD card?
On Tue, Feb 09, 2010 at 03:11:38PM +1100, Daniel Carosone wrote: I didn't find anything to indicate either way whether there was bootable bios on board Ah - in the install guide there's a mention about pressing F4 or Ctrl-S when prompted at boot to configure the raid format, so there's evidently is some bios. again this might be a potential hurdle with the ppb if you intend to boot from it. pgp69OTd5RWBA.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
I would have thought that if I write 1k then ZFS txg times out in 30secs, then the 1k will be written to disk in a 1k record block, and then if I write 4k then 30secs latter txg happen another 4k record size block will be written, and then if I write 130k a 128k and 2k record block will be written. Making the file have record sizes of 1k+4k+128k+2k -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup Questions.
On Feb 8, 2010, at 6:04 PM, Kjetil Torgrim Homme wrote: Tom Hall thattommyh...@gmail.com writes: If you enable it after data is on the filesystem, it will find the dupes on read as well as write? Would a scrub therefore make sure the DDT is fully populated. no. only written data is added to the DDT, so you need to copy the data somehow. zfs send/recv is the most convenient, but you could even do a loop of commands like cp -p $file $file.tmp mv $file.tmp $file Re the DDT, can someone outline it's structure please? Some sort of hash table? The blogs I have read so far dont specify. I can't help here. UTSL Re DDT size, is (data in use)/(av blocksize) * 256bit right as a worst case (ie all blocks non identical) the size of an entry is much larger: | From: Mertol Ozyoney mertol.ozyo...@sun.com | Subject: Re: Dedup memory overhead | Message-ID: 00cb01caa580$a3d6f110$eb84d330$%ozyo...@sun.com | Date: Thu, 04 Feb 2010 11:58:44 +0200 | | Approximately it's 150 bytes per individual block. What are average block sizes? as a start, look at your own data. divide the used size in df with used inodes in df -i. example from my home directory: $ /usr/gnu/bin/df -i ~ FilesystemInodes IUsed IFree IUse%Mounted on tank/home 223349423 3412777 219936646 2%/volumes/home $ df -k ~ Filesystemkbytes used avail capacity Mounted on tank/home 573898752 257644703 10996825471%/volumes/home so the average file size is 75 KiB, smaller than the recordsize of 128 KiB. extrapolating to a full filesystem, we'd get 4.9M files. unfortunately, it's more complicated than that, since a file can consist of many records even if the *average* is smaller than a single record. a pessimistic estimate, then, is one record for each of those 4.9M files, plus one record for each 128 KiB of diskspace (2.8M), for a total of 7.7M records. the size of the DDT for this (quite small!) filesystem would be something like 1.2 GB. perhaps a reasonable rule of thumb is 1 GB DDT per TB of storage. zdb -D poolname will provide details on the DDT size. FWIW, I have a pool with 52M DDT entries and the DDT is around 26GB. $ pfexec zdb -D tank DDT-sha256-zap-duplicate: 19725 entries, size 270 on disk, 153 in core DDT-sha256-zap-unique: 52284055 entries, size 284 on disk, 159 in core dedup = 1.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.00 (you can tell by the stats that I'm not expecting much dedup :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
On Feb 8, 2010, at 9:10 PM, Damon Atkins wrote: I would have thought that if I write 1k then ZFS txg times out in 30secs, then the 1k will be written to disk in a 1k record block, and then if I write 4k then 30secs latter txg happen another 4k record size block will be written, and then if I write 130k a 128k and 2k record block will be written. Making the file have record sizes of 1k+4k+128k+2k Close. Once the max record size is achieved, it is not reduced. So the allocation is: 1KB + 4KB + 128KB + 128KB Physical writes tend to be coalesced, which is one reason why a transactional system performs well. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss