Re: [zfs-discuss] zpool split problem?
You mean /usr/sbin/sys-unconfig? No, it does not reset a system back far enough. You still left with the orginal path_to_inst and the device tree. e.g. take a disk to a different system and the first disk might end up being sd10 and c15t0d0s0 instead of sd0 and c0 without cleaning up the system first. ie. removing /etc/path_to_inst and most of what is in the device tree. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool split problem?
Why do we still need /etc/zfs/zpool.cache file??? (I could understand it was useful when zfs import was slow) zpool import is now multi-threaded (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191), hence a lot faster, each disk contains the hostname (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6282725) , if a pool contains the same hostname as the server then import it. ie This bug should not be a problem any more http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6737296 with a multi-threaded zpool import. HA Storage should be changed to just do a zpool -h import mypool instead of using a private zpool.cache file (-h being ignore if the pool was imported by a different host, and maybe a noautoimport property is need on a zpool so clustering software can decided to import it by hand as it was) And therefore this zpool zplit problem would be fixed. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool split problem?
I assume the swap, dumpadm, grub is because the pool has a different name now, but is it still a problem if you take it to a *different system* boot off a CD change it back to rpool. (which is most likley unsupported, ie no help to get it working) Over 10 years ago (way before flash archive existed) I developed a script, used after spliting a mirror, which would remove most of the device tree, cleaned up path_to_inst etc so it look like the OS was just installed and about to do the reboot without the install CD. (every thing was still in there expect for hardware specific stuff, I no longer have the script and most likey would not do it again because its not a supported install method) I still had to boot from CD on the new system and create the dev tree before booting off the disk for the first time, and then fix vfstab (but the fix vfstab should be gone with zfs rpool) It would be nice for Oracle/Sun to produce a separate script which reset system/devices back to a install like begining so if you move a OS disk with current password file and software from one system to another, and have it rebuild the device tree on the new system. From member (updated for zfs) something like: zfs split rpool newrpool mount newrpool remove newrpool/dev and newrpool/devices of all non-packages content (ie dynamically created content) clean up newrpool/etc/path_to_inst create /newrool/reconfigure remove all prevoius snapshots in newrool update beadm info inside newrpool ensure grub is installed on the disk -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs send and ARC
In the Thoughts on ZFS Pool Backup Strategies thread it was stated that zfs send, sends uncompress data and uses the ARC. If zfs send sends uncompress data which has already been compress this is not very efficient, and it would be *nice* to see it send the original compress data. (or an option to do it) I thought I would ask a true or false type questions mainly for curiosity sake. If zfs send uses standard ARC cache (when something is not already in the ARC) I would expect this to hurt (to some degree??) the performance of the system. (ie I assume it has the effect of replacing current/useful data in the cache with not very useful/old data depending on how large the ZFS send is) If above true, zfs send and “zfs backup” (if it the cmd existed to backup and restore a file or set of files with all ZFS attributes) would improve the performance of normal read/write by avoiding the ARC cache (or if easier to implement having its own private ARC cache). Or does it use the same sort of code, as setting “primarycache=none” on a file system. Has anyone monitored ARC hit rates while doing a large zfs send? Cheers -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file system confusion
NFSv4 has a concept of a root of the overall exported filesystem (Pseudofilesystem). FileHandle 0 in terms of Linux it is setting fsid=0 when exporting. Which would explain why someone said Linux (NFSv4) automounts an exported filesystem under another exported filesystem ie mount servername:/ and be able to browes all exported/shared file systems you have access to. I don't think this made it into Solaris NFSv4 server. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CR 6880994 and pkg fix
You could try copying the file to /tmp (ie swap/ram) and do a continues loop of checksums e.g. while [ ! -f ibdlpi.so.1.x ] ; do sleep 1; cp libdlpi.so.1 libdlpi.so.1.x ; A=`sha512sum -b libdlpi.so.1.x` ; [ $A == what it should be libdlpi.so.1.x ] rm libdlpi.so.1.x ; done ; date Assume the file never goes to swap, it would tell you if something on the motherboard is playing up. I have seen CPU randomly set a byte to 0 which should not be 0, think it was an L1 or L2 cache problem. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CR 6880994 and pkg fix
you could also use psradm to take a CPU off-line. At boot I would ??assume?? the system boots the same way every time unless something changes, so you could be hiting the came CPU core every time or the same bit of RAM until booted fully. Or even run SunVTS Validation Test Suite which I belive has a simlar test to the cp in /tmp and all the other tests it has. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
A system with 100TB of data its 80% full and the a user ask their local system admin to restore a directory with large files, as it was 30days ago with all Windows/CIFS ACLS and NFSv4/ACLS etc. If we used zfs send, we need to go back to a zfs send some 30days ago, and find 80TB of disk space to be able to restore it. zfs send/recv is great for copy zfs from one zfs file system to another file system even across servers. But their needs to be a tool: * To restore an individual file or a zvol (with all ACLs/properties) * That allows backup vendors (which place backups on tape or disk or CD or ..) build indexes of what is contain in the backup (e.g. filename, owner, size modification dates, type (dir/file/etc) ) *Stream output suitable for devices like tape drives. *Should be able to tell if the file is corrupted when being restored. *May support recovery of corrupt data blocks within the stream. *Preferable gnutar command-line compatible *That admins can use to backup and transfer a subset of files e.g user home directory (which is not a file system) to another server or on to CD to be sent to their new office location, or For backup vendors is the idea for them to use NDMP protocol to backup ZFS and all its properties/ACLs? Or is a new tool required to achieve the above?? Cheers -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
I vote for zfs needing a backup and restore command against a snapshot. backup command should output on stderr at least Full_Filename SizeBytes Modification_Date_1970secSigned so backup software can build indexes and stdout contains the data. The advantage of zfs providing the command is that as ZFS upgrades or new features are added backup vendors do not need to re-test their code. Could also mean that when encryption comes a long a property on pool could indicate if it is OK to decrypt the filenames only as part of a backup. restore would work the same way except you would pass a filename or a directory to restore etc. And backup software would send back the stream to zfs restore command. The other alternative is for zfs to provide a standard API for backups like Oracle does for RMAN. It would be very useful with snapshots across pools http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6916404 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
pantzer5 wrote: These days I am a fan for forward check access lists, because any one who owns a DNS server can say that for IPAddressX returns aserver.google.com. They can not set the forward lookup outside of their domain but they can setup a reverse lookup. The other advantage is forword looking access lists is you can use DNS Alias in access lists as well. That is not true, you have to have a valid A record in the correct domain. I am not sure what this means, unless it indicates every application follows the steps outline below. Unfortunately, only a few applications/services do. This is how it works (and how you should check you reverse lookups in your applications): 1. Do a reverse lookup. 1b check if the name matches any hosts listed in the access list 2. Do a lookup with the name from 1. 3. Check that the IP address is one of the addresses you got in 2. Ignore the reverse lookup if the check in 3 fails. The above describes a forward lookup check, its just uses reverse lookup to determine what forward to lookup. The other method is when the service starts or re-reads the access list it finds A record/IP address for all the names in the access list and keeps a record of them, which it uses for checking when a connection comes in, saves doing the DNS lookup when a new connection starts, but it means all the DNS overhead is at the start. Unfortunately DNS spoofing exists, which means forward lookups can be poison. The best (maybe only) way to make NFS secure is NFSv4 and Kerb5 used together. Cheers -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Should ZFS write data out when disk are idle
For a RaidZ, when data is written to a disk, are individual 32k join together to the same disk and written out as a single I/O to the disk? I/Os can be coalesced, but there is no restriction as to what can be coalesced. In other words, subsequent writes can also be coalesced if they are contiguous. e.g. 128k for file a, 128k for file b, 128k for file c. When written out does zfs do 32k+32k+32k i/o to each disk, or will it do one 96k i/o if the space is available sequentially? Should have written this, for a 5 disk RaidZ 5x(32k(a)+32k(b)+32k(c) i/o to each disk), or will it attempt to do 5x(96k(a+b+c)) combind larger I/O to each disk if all allocated blocks for a,b and c are sequential on some or every physical disk. I'm not sure how one could write one 96KB physical I/O to three different disks? I meant to a single disk, three sequential 32k i/o's targeted to the same disk becomes a single 96k i/o. (raidz or even if it was mirrored) -- richard Given you have said ZFS will coalesce contiguous writes together? (???Targeted to an individual disk?). What is the largest physical write ZFS will do to an individual disk? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
In /etc/hosts for the format is IP FQDN Alias... Which would means 1.1.1.1 aserver.google.com aserver aserver-le0 I have seen a lot of sysadmins do the following: 1.1.1.1 aserver aserver.google.com which means the host file (or NIS) does not match DNS As the first entry is FQDN it is then name return when an application looks up an IP address. In the first example 1.1.1.1 belongs to aserver.google.com (FQDN) and access lists need to match this (e.g. .rhost/nfs shares) e.g. dig -x 1.1.1.1 | egrep PTR And it will return FQDN for example aserver.google.com (assuming a standard DNS setup) These days I am a fan for forward check access lists, because any one who owns a DNS server can say that for IPAddressX returns aserver.google.com. They can not set the forward lookup outside of their domain but they can setup a reverse lookup. The other advantage is forword looking access lists is you can use DNS Alias in access lists as well. e.g. NFS share should do a DNS lookup on aserver.google.com get an IP Address or multiple IP Address and then check to see if the client has the same IP address rather than a string match. PS I read in the doco that as of Solaris 10 hostname should be set to FQDN if you wish to use Kerb5. e.g. hostname command should return aserver.google.com.au not aserver if you wish to use Kerb5 Sol10. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Should ZFS write data out when disk are idle
I am talking about having a write queue, which points to ready to write, full stripes. Ready to write full stripes would be *The last byte of the full stripe has been updated. *The file has been closed for writing. (Exception to the above rule) I believe there is now a scheduler for ZFS, to handle reads and write conflicts. For example on a large Multi-Gigabyte NVRAM array, the only big consideration is how big is the Fibre Channel pipe is and the limit on outstanding I/Os But on SATA off the motherboard, then it is about how much RAM cache each disk has is a consideration as well as the speed of the SATA connection as well as the number of outstanding I/Os When it comes time to do txg some of the record blocks (most of the full 128k ones) will have been written out already. If we have only written out full record blocks then there has been no performance loss. Eventually a txg going to happen, eventually these full writes will need to happen, but if we can choose a less busy time for them all the better. e.g. on a raidz with 5 disks, if I have 128x4 worth of data to write, lets write it. on a mirror if I have 128k worth to write, lets write it. (record size 128k), or let it be a tunable for zpool, as some arrays (RAID5) like to have larger chunks of data. Why wait for the txg if the disk are not being pressured for reads. Rather than a pause every 30 seconds. Bob wrote : (I may not have explained it well enough) It is not true that there is no cost though. Since ZFS uses COW, this approach requires that new blocks be allocated and written at a much higher rate. There is also an opportunity cost in that if a read comes in while these continuous writes are occurring, the read will be delayed. At some stage a write needs to happen. **Full** writes have very small COW cost compare with small writes. As I said above I talking about a write of 4x128k on a 5 disk raidz before the write would happen early. There are many applications which continually write/overwrite file content, or which update a file at a slow pace. For example, log files are typically updated at a slow rate. Updating a block requires reading it first (if it is not already cached in the ARC), which can be quite expensive. By waiting a bit longer, there is a much better chance that the whole block is overwritten, so zfs can discard the existing block on disk without bothering to re-read it. Apps which update at slow pace will not trigger the above early write, until they have at least written a record size worth of data, application which write slow than 128k (recordsize) in more than 30 secs will never trigger the early write on a mirrored disk or even a raidz setup. What this will catch is the big writer of files greater than 128k (recordsize) on mirrored disk; and files larger than (4x128k) on RaidZ 5disks sets. So that commands like dd if=x of=y bs=512k will not cause issues (pauses/delays) when the txg timeout. PS I already set zfs:zfs_write_limit_override and I would not recommend anyone to set this very low to get the above effect. It's just an idea on how to prevent the delay effect, it may not be practical? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Should ZFS write data out when disk are idle
Sorry, Full Stripe on a RaidZ is the recordsize ie if the record size is 128k on a RaidZ and its made up of 5 disks, then 128k is spread across 4 disks with the calc parity on the 5 disk, which means the writes are 32k to each disk. For a RaidZ, when data is written to a disk, are individual 32k join together to the same disk and written out as a single I/O to the disk? e.g. 128k for file a, 128k for file b, 128k for file c. When written out does zfs do 32k+32k+32k i/o to each disk, or will it do one 96k i/o if the space is available sequentially? Cheers -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Should ZFS write data out when disk are idle
I think ZFS should look for more opportunities to write to disk rather than leaving it to the last second (5seconds) as it appears it does. e.g. if a file has record size worth of data outstanding it should be queued within ZFS to be written out. If the record is updated again before a txg, then it can be re-queued (if it has left the queue) and written to the same block or a new block. The write queue would empty when there is spare I/O bandwidth capacity and memory capacity on the disk determined thought outstanding I/Os. Once the data is on disk it could be free to be re-used even before the txg has occurred, but checksum details would need to be recorded first. The txg comes along after X seconds and finds most of the data writes have already happen and only metadata writes are left to do. One would should assume this would help with the delays at txg, talked about in this thread. The example below shows 28 x 128k writes to the same file before anything is written to disk and the disk are idle the entire time. There is no cost to writing to disk if the disk is not doing anything or is under capacity. (Not a perfect example) At the other end maybe updates for access time properties should not be updated to disk until there is some real data to write, or 30minutes has passed to allow green disks to power down for a while. (atime= on|off|delay) Cheers No dedup on, but compression on while sleep 1 ; do echo `dd if=/dev/random of= bs=128k count=1 21` ; done iostat -zxcnT d 1 us sy wt id 0 5 0 94 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 53.00.0 301.5 0.0 0.20.03.4 0 4 c5t0d0 0.0 53.00.0 301.5 0.0 0.20.03.1 0 4 c5t2d0 0.0 58.00.0 127.0 0.0 0.00.00.1 0 0 c5t1d0 0.0 58.00.0 127.0 0.0 0.00.00.1 0 0 c5t3d0 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:41 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.03.00.02.0 0.0 0.00.00.5 0 0 c5t0d0 0.03.00.02.0 0.0 0.00.00.5 0 0 c5t2d0 0.01.00.00.0 0.0 0.00.00.0 0 0 c5t1d0 0.01.00.00.0 0.0 0.00.00.0 0 0 c5t3d0 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:42 PM EST cpu us sy wt id 1 3 0 96 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:43 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:44 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:45 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:46 PM EST cpu us sy wt id 1 4 0 95 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:47 PM EST cpu us sy wt id 0 4 0 96 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:48 PM EST cpu us sy wt id 0 19 0 80 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:49 PM EST cpu us sy wt id 1 27 0 72 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:50 PM EST cpu us sy wt id 0 3 0 96 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:51 PM EST cpu us sy wt id 1 3 0 96 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:52 PM EST cpu us sy wt id 0 4 0 95 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0+1 records in 0+1 records out Monday, 8 March 2010 02:51:53 PM EST cpu us sy wt id 0 4 0 96 extended device statistics
Re: [zfs-discuss] Help with corrupted pool
Create a new empty pool on the solaris system, let it format the disks etc ie used the disk names cXtXd0 This should put the EFI label on the disks and then setup the partitions for you. Just encase here is an example. Go back to the Linux box, and see if you can use tools to see the same partition layout, if you can then dd it to the currect spot which in Solaris c5t2d0s0. (zfs send|zfs recv would be easier) -bash-4.0$ pfexec fdisk -R -W - /dev/rdsk/c5t2d0p0 * /dev/rdsk/c5t2d0p0 default fdisk table * Dimensions: *512 bytes/sector *126 sectors/track *255 tracks/cylinder * 60800 cylinders * * systid: *1: DOSOS12 * 238: EFI_PMBR * 239: EFI_FS * * IdAct Bhead Bsect BcylEhead Esect EcylRsect Numsect 238 025563 102325563 10231 1953525167 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 -bash-4.0$ pfexec prtvtoc /dev/rdsk/c5t2d0 * /dev/rdsk/c5t2d0 partition map * * Dimensions: * 512 bytes/sector * 1953525168 sectors * 1953525101 accessible sectors * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First SectorLast * Sector CountSector * 34 222 255 * * First Sector Last * Partition Tag FlagsSector Count Sector Mount Directory 0 4 00 2561953508495 1953508750 8 1100 1953508751 163841953525134 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
May be look at rsync and rsync lib (http://librsync.sourceforge.net/) code to see if a ZFS API could be design to help rsync/librsync in the future as well as diff. It might be a good idea for POSIX to have a single checksum and a multi-checksum interface. One problem could be block sizes, if a file is re-written and is the same size it may have different ZFS record sizes within, if it was written over a long period of time (txg's)(ignoring compression), and therefore you could not use ZFS checksum to compare two files. Side Note: It would be nice if ZFS on every txg only wrote full record sizes unless it was short on memory, or a file was closed. Maybe the txg could happen more often if it just scanned for full recordsize's writes and closed files. Or block which had not be altered for three scan's. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?
I would have thought that if I write 1k then ZFS txg times out in 30secs, then the 1k will be written to disk in a 1k record block, and then if I write 4k then 30secs latter txg happen another 4k record size block will be written, and then if I write 130k a 128k and 2k record block will be written. Making the file have record sizes of 1k+4k+128k+2k -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] send/received inherited bug?, received overrides parent, snv_130 6920906
Here is the output -bash-4.0# uname -a SunOS 5.11 snv_130 i86pc i386 i86pc -bash-4.0# zfs get -r -o all compression mainfs01 | egrep -v \@ NAMEPROPERTY VALUE RECEIVED SOURCE mainfs01 compression gzip-3- local mainfs01/home compression gzip-3lzjb local mainfs01/mysql compression gzip-3- inherited from mainfs01 -bash-4.0# zfs inherit compression mainfs01/home -bash-4.0# zfs get -r -o all compression mainfs01 | egrep -v \@ NAMEPROPERTY VALUE RECEIVED SOURCE mainfs01compression gzip-3- local mainfs01/home compression lzjb lzjb received mainfs01/mysql compression gzip-3- inherited from mainfs01 -bash-4.0# zfs inherit -S compression mainfs01/home -bash-4.0# zfs get -r -o all compression mainfs01 | egrep -v \@ NAMEPROPERTY VALUE RECEIVED SOURCE mainfs01compression gzip-3- local mainfs01/home compression lzjb lzjb received mainfs01/mysql compression gzip-3- inherited from mainfs01 -bash-4.0# zfs inherit compression mainfs01/home -bash-4.0# zfs get -r -o all compression mainfs01 | egrep -v \@ NAMEPROPERTY VALUE RECEIVED SOURCE mainfs01compression gzip-3- local mainfs01/home compression lzjb lzjb received mainfs01/mysql compression gzip-3- inherited from mainfs01 How do I get this to say NAMEPROPERTY VALUE RECEIVED SOURCE mainfs01 compression gzip-3- local mainfs01/homecompression [b]gzip-3[/b] lzjb [b]inherited from mainfs01[/b] mainfs01/mysqlcompression gzip-3- inherited from mainfs01 Cheers -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool fragmentation issues? (dovecot)
In my previous post I was refering more to mdbox (Multi-dbox) rather than dbox, however I beleive the meta data is store with the mail msg in version 1.x and 2.x meta is not updated within the msg which would be better for ZFS. What I am saying is msg per file which is not updated is better for snapshots. I belive 2.x version of single-dbox should be better (ie meta data is no longer stored with the msg) compared with 1.x dbox for snapshots. Cheers -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Practical) limit on the number of snapshots?
One thing which may help is the zfs import was single threaded, ie it open every disk one disk (maybe slice) at a time and processed it, as of 128b it is multi-threaded, ie it opens N disks/slices at once and process N disks/slices at once. When N is the number of threads it decides to use. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191 This most like/maybe cause other parts of the process to now become multi-threaded as well. It would be nice to no longer have /etc/zfs/zpool.cache, now zfs import is fast enough. (which is a second reason I longed the bug) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Import a SAN cloned disk
Before Veritas VM had support for this, you needed to use a different server to import a disk group. You could use a different server for ZFS, which will also take the backup load off the Server? Cheers -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Transaction consistency of ZFS
Because ZFS is transaction, (effectively preserves order), the rename trick will work. If you find the .filename delete create a new .filename and when finish writing rename it to filename. If filename exists you no all writes were completed. If you have a batch system which looks for the file it will not find it until it is renamed. Not that I am a of batch systems which use CPU poll for files existance. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Accidentally added disk instead of attaching
What about removing attach/deattach and replace it with zpool add [-fn] 'pool' submirror 'device/mirrorname' 'new_device' e.g. NAMESTATE READ WRITE CKSUM rpoolONLINE 0 0 0 mirror-01 ONLINE 0 0 0 c4d0s0 ONLINE 0 0 0 c3d0s0 ONLINE 0 0 0 zpool add rpool submirror mirror-01 c5d0s0 # or zpool add rpool submirror c4d0s0 c5d0s0 zpool remove rpool c5d0s0 Some more examples zpool add 'pool' submirror log-01 c7d0s0 # create a mirror for the Intent Log And may be one day zpool add 'pool' subraidz raidz2-01 c5d0s0 to add extra disk to raidz group and have the disk restriped in the background Which would mean vdev in terms of syntax would support concat (was disk), concat-file (was file), mirror, submirror, raidz, raidzN, subraidz (one day), spare, log, cache -- And change zpool add rpool disk c5d0s0 to zpool add rpool concat c5d0s0 # instead of disk use concat or zpool add rpool concatfile path to file # instead of file Cheers -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Transaction consistency of ZFS
If power failure happens you will lose anything in cache. So you could lose the entire file on power failure if the system is not busy (ie ZFS does delay writes, unless you do a fsync before closing the file). I would still like to see a file system option sync on close or even wait for txg on close Some of the best methods are to create a temp file e.g. .download.filename and rename when the download (or what ever) is sucessfull to filename Or create a extra empty file to say it has been completed e.g. filename.dn. I prefer the rename trick. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpools on USB zpool.cache zpool import
The zpool.cache file makes clustering complex. {Assume the man page is still correct} From the zpool man page: cachefile=path | none Controls the location of where the pool configuration is cached. Discovering all pools on system startup requires a cached copy of the configuration data that is stored on the root file system. All pools in this cache are automatically imported when the system boots. Some environments, such as install and clustering, need to cache this information in a different location so that pools are not automatically imported. * Setting this property caches the pool configuration in a different location that can later be imported with zpool import -c. ... When the last pool using a cache file is exported or destroyed, the file is removed. zpool import [-d dir | -c cachefile] [-D] Lists pools available to import. If the -d option is not specified, this command searches for devices in /dev/dsk. -- A truss of zpool import indicates that it is not multi-threaded when scanning for disks. ie. it scans 1 at a time instead of X at a time. So it does take a while to run. Would be nice if this was multi-threaded. If the cache file is to stay, it should do a scan of /dev to fix itself at boot if something is wrong, and report it is doing a scan to the console. esp if it is not multi-threaded. PS it would be nice to have a zpool diskinfo devicepath reports if the device belongs to a zpool imported or not, and all the details about any zpool it can find on the disk. e.g. file-systems (zdb is only for ZFS engineers says the man page). 'zpool import' needs an option to list the file systems of a pool which is not yet imported and its properties so you can have more information about it before importing it. Cheers Original Message On Mon, Mar 23, 2009 at 4:45 PM, Mattias Pantzare pantz...@gmail.com mailto:pantz...@gmail.com wrote: If I put my disks on a diffrent controler zfs won't find them when I boot. That is bad. It is also an extra level of complexity. Correct me if I'm wrong, but wading through all of your comments, I believe what you would like to see is zfs automatically scan if the cache is invalid vs. requiring manual intervention, no? It would seem to me this would be rather sane behavior and a legitimate request to add this as an option. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpools on USB zpool.cache
Do we still need the zpool.cache still. I believe early versions of zpool used the cache to remember what zpools to import at boot. I understand newer versions of zfs still use the cache but also check to see if the pool contains the correct host name of the server, and will only import if the hostname matches. I suggest ZFS at boot should (multi-threaded) scan every disk for ZFS disks, and import the ones with the correct host name and with a import flag set, without using the cache file. Maybe just use the cache file for non-EFI disk/partitions, but without the storing the pool name, but you should be able to tell ZFS to do a full scan which includes partition disk. Cheers Original Message ZFS maintains a cache of what pools were imported so that at boot time, it will automatically try to re-import the pool. The file is /etc/zfs/zpool.cache and you can view its contents by using zdb -C If the current state of affairs does not match the cache, then you can export the pool, which will clear its entry in the cache. Then retry the import. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpools on USB zpool.cache zpool import
The zpool.cache file makes clustering complex. {Assume the man page is still correct} From the zpool man page: cachefile=path | none Controls the location of where the pool configuration is cached. Discovering all pools on system startup requires a cached copy of the configuration data that is stored on the root file system. All pools in this cache are automatically imported when the system boots. Some environments, such as install and clustering, need to cache this information in a different location so that pools are not automatically imported. * Setting this property caches the pool configuration in a different location that can later be imported with zpool import -c. ... When the last pool using a cache file is exported or destroyed, the file is removed. zpool import [-d dir | -c cachefile] [-D] Lists pools available to import. If the -d option is not specified, this command searches for devices in /dev/dsk. -- A truss of zpool import indicates that it is not multi-threaded when scanning for disks. ie. it scans 1 at a time instead of X at a time. So it does take a while to run. Would be nice if this was multi-threaded. If the cache file is to stay, it should do a scan of /dev to fix itself at boot if something is wrong, and report it is doing a scan to the console. esp if it is not multi-threaded. PS it would be nice to have a zpool diskinfo devicepath reports if the device belongs to a zpool imported or not, and all the details about any zpool it can find on the disk. e.g. file-systems (zdb is only for ZFS engineers says the man page). 'zpool import' needs an option to list the file systems of a pool which is not yet imported and its properties so you can have more information about it before importing it. Cheers Original Message On Mon, Mar 23, 2009 at 4:45 PM, Mattias Pantzare pantz...@gmail.com mailto:pantz...@gmail.com wrote: If I put my disks on a diffrent controler zfs won't find them when I boot. That is bad. It is also an extra level of complexity. Correct me if I'm wrong, but wading through all of your comments, I believe what you would like to see is zfs automatically scan if the cache is invalid vs. requiring manual intervention, no? It would seem to me this would be rather sane behavior and a legitimate request to add this as an option. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] device alias
ZFS should allow 31+NULL chars for a comment against each disk. This would work well with the host name string (I assume is max_hostname 255+NULL) If a disk fails it should report c6t4908029d0 failed comment from disk, it should also remember the comment until reboot This would be useful for DR, or in clusters. By the operator giving a disk a comment they can check it's existence on a different server, work out which one is missing and fix it before doing an import. You would also need a command to dump it out without importing a disk. In fact it would be nice to have a tool that check to see if a disk is a zfs disk and print out its info without the need to import it. Cheers ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wish List
Close Sync on file systems option (ie when the app calls close the file is flushed, including mmap, no data loss of closed files on system crash) Atomic/Locked operations across all pools e.g. snapshot all or selected pools at the same time. Allowance for offline files, eg. first part of a file can be on disk, the last part can be on disk, the rest on tape/cd/dvd/blue-ray etc. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with HDS TrueCopy and EMC SRDF
Date: Thu, 26 Jul 2007 20:39:09 PDT From: Anton B. Rang That said, I?m not sure exactly what this buys you for disk replication. What?s special about files which have been closed? Is the point that applications might close a file and then notify some other process of the file?s availability for use? Yes E.g. 1 Program starts output job,and completes job in OS Cache on Server A. Server A tells batch scheduling software on Server B, that job is complete. Server A Crashes, file no longer exists or is truncated due to what is left in the OS Cache. Server B Schedules the next job, on the assumption that the file creates on Server A is ok. E.g. 2 Program starts output job,and completes job in OS Cache on Server A. A DB on Server A running in a different ZFS Pool, updates a DB record to record the fact the output is complete (DB uses O_DSYNC) Server A Crashes, file no longer exists or is truncated due to what is left in the OS Cache. Server A DB contains information saying that the file is completed. I believe that sync-on-close should be the default. File systems integrity should be more than just being able to read a file which has been truncated due to a system crash/power failure etc. E.g. 3 (a bit cheeky -:) $ vi a file, save the file, system crashes, you look back at the screen and you say thank god, I save the file in time, because on your screen in the prompt $ again. This is all happening in the OS Cache file. When the system returns the file does not exist. (I am ignoring vi -r) $ vi x $ connection lost Therefore users should do $ vi x $ sleep 5 ; echo file x now on disk :-) $ echo add a line x $ sleep 5; echo update to x complete UFS forcedirectio and VxFS closesync ensure that what ever happens your files will always exist if the program completes. Therefore with Disk Replication (sync) the file exists at the other site at its finished size. When you introduce DR with Disk Replication, general means you can not afford to lose any save data. UFS forcedirectio has a larger performance hit than VxFS closesync. Cheers This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with HDS TrueCopy and EMC SRDF [CloseSync]
Date: Thu, 26 Jul 2007 20:39:09 PDT From: Anton B. Rang [EMAIL PROTECTED] That said, I?m not sure exactly what this buys you for disk replication. What?s special about files which have been closed? Is the point that applications might close a file and then notify some other process of the file?s availability for use? Yes E.g. 1 Program starts output job,and completes job in OS Cache on Server A. Server A tells batch scheduling software on Server B, that job is complete. Server A Crashes, file no longer exists or is truncated due to what is left in the OS Cache. Server B Schedules the next job, on the assumption that the file creates on Server A is ok. E.g. 2 Program starts output job,and completes job in OS Cache on Server A. A DB on Server A running in a different ZFS Pool, updates a DB record to record the fact the output is complete (DB uses O_DSYNC) Server A Crashes, file no longer exists or is truncated due to what is left in the OS Cache. Server A DB contains information saying that the file is completed. I believe that sync-on-close should be the default. File systems integrity should be more than just being able to read a file which has been truncated due to a system crash/power failure etc. E.g. 3 (a bit cheeky -:) $ vi a file, save the file, system crashes, you look back at the screen and you say thank god, I save the file in time, because on your screen in the prompt $ again. This is all happening in the OS Cache file. When the system returns the file does not exist. (I am ignoring vi -r) $ vi x $ connection lost Therefore users should do $ vi x $ sleep 5 ; echo file x now on disk :-) $ echo add a line x $ sleep 5; echo update to x complete UFS forcedirectio and VxFS closesync ensure that what ever happens your files will always exist if the program completes. Therefore with Disk Replication (sync) the file exists at the other site at its finished size. When you introduce DR with Disk Replication, general means you can not afford to lose any save data. UFS forcedirectio has a larger performance hit than VxFS closesync. Cheers ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS with HDS TrueCopy and EMC SRDF
Guys, What is the best way to ask for a feature enhancement to ZFS. To allow ZFS to be usefull for DR disk replication, we need to be able set an option against the pool or file system or both, called close sync. ie When a programme closes a file any outstanding writes are flush to disk, before the close returns to the programme. So when a programme ends you are guarantee any state information is save to the disk. (exit() also results in close being called) open(xxx, O_DSYNC) is only good if you can alter the source code. Shell scripts use of awk, head, tail, echo etc to create output files do not use O_DSYNC, when the shell script returns 0, you want to know that all the data is on the disk, so if the system crashes the data is still there. PS it would be nice if UFS had closessync as well, instead of using forcedirectio. Cheers ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss