Re: [zfs-discuss] zpool split problem?

2010-04-01 Thread Damon Atkins
 You mean /usr/sbin/sys-unconfig?

No, it does not reset a system back far enough.
You still left with the orginal path_to_inst and the device tree.
e.g. take a disk to a different system and the first disk might end up
being sd10  and c15t0d0s0 instead of sd0 and c0 without cleaning up the system 
first.
ie. removing /etc/path_to_inst and most of what is in the device tree.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool split problem?

2010-03-31 Thread Damon Atkins
Why do we still need /etc/zfs/zpool.cache file??? 
(I could understand it was useful when zfs import was slow)

zpool import is now multi-threaded 
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191), hence a 
lot faster,  each disk contains the hostname 
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6282725) , if a 
pool contains the same hostname as the server then import it.

ie This bug should not be a problem any more 
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6737296 with a 
multi-threaded zpool import.

HA Storage should be changed to just do a zpool -h import mypool instead of 
using a private zpool.cache file (-h being ignore if the pool was imported by a 
different host, and maybe a noautoimport property is need on a zpool so 
clustering software can decided to import it by hand as it was)

And therefore this zpool zplit problem would be fixed.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool split problem?

2010-03-31 Thread Damon Atkins
I assume the swap, dumpadm, grub is because the pool has a different name now, 
but is it still a problem if you take it to a *different system* boot off a CD 
change it back to rpool. (which is most likley unsupported, ie no help to get 
it working)

Over 10 years ago (way before flash archive existed)  I developed a script, 
used after spliting a mirror, which would remove most of the device tree, 
cleaned up path_to_inst etc so it look like the OS was just installed and about 
to do the reboot without the install CD. (every thing was still in there expect 
for hardware specific stuff, I no longer have the script and most likey would 
not do it again because its not a supported install method)

I still had to boot from CD on the new system and create the dev tree before 
booting off the disk for the first time, and then fix vfstab (but the fix 
vfstab should be gone with zfs rpool)

It would be nice for Oracle/Sun to produce a separate script which reset 
system/devices  back to a install like begining so if you move a OS disk with 
current password file and software from one system to another, and have it 
rebuild the device tree on the new system.

From member (updated for zfs) something like:
zfs split rpool newrpool
mount newrpool
remove newrpool/dev and newrpool/devices of all non-packages content (ie 
dynamically created content)
clean up newrpool/etc/path_to_inst
create /newrool/reconfigure
remove all prevoius snapshots in newrool
update beadm info inside newrpool
ensure grub is installed on the disk
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs send and ARC

2010-03-25 Thread Damon Atkins
In the Thoughts on ZFS Pool Backup Strategies thread it was stated that zfs 
send, sends uncompress data and uses the ARC.

If zfs send sends uncompress data which has already been compress this is not 
very efficient, and it would be *nice* to see it send the original compress 
data. (or an option to do it)

I thought I would ask a true or false type questions mainly for curiosity sake.

If zfs send uses standard ARC cache (when something is not already in the 
ARC) I would expect this to hurt (to some degree??) the performance of the 
system. (ie I assume it has the effect of replacing current/useful data in the 
cache with not very useful/old data depending on how large the ZFS send is)


If above true,  zfs send and “zfs backup” (if it the cmd existed to backup and 
restore a file or set of files with all ZFS attributes) would improve the 
performance of normal read/write by avoiding the ARC cache (or if easier to 
implement having its own private ARC cache).

Or does it use the same sort of code, as setting “primarycache=none” on a file 
system.

Has anyone monitored ARC hit rates while doing a large zfs send?

Cheers
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file system confusion

2010-03-25 Thread Damon Atkins
NFSv4 has a concept of a root of the overall exported filesystem 
(Pseudofilesystem).

FileHandle 0 in terms of Linux it is setting fsid=0 when exporting.

Which would explain why someone said Linux (NFSv4) automounts an exported 
filesystem under another exported filesystem

ie mount servername:/   and be able to browes all exported/shared file systems 
you have access to.

I don't think this made it into Solaris NFSv4 server.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-24 Thread Damon Atkins
You could try copying the file to /tmp (ie swap/ram) and do a continues loop of 
checksums  e.g.

while [ ! -f  ibdlpi.so.1.x ] ; do sleep 1; cp libdlpi.so.1 libdlpi.so.1.x ; 
A=`sha512sum -b libdlpi.so.1.x` ; [ $A == what it should be 
libdlpi.so.1.x ]  rm libdlpi.so.1.x ; done ; date

Assume the file never goes to swap, it would tell you if something on the 
motherboard is playing up.

I have seen CPU randomly set a byte to 0 which should not be 0, think it was an 
L1 or L2 cache problem.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] CR 6880994 and pkg fix

2010-03-24 Thread Damon Atkins
you could also use psradm to take a CPU off-line.

At boot I would ??assume?? the system boots the same way every time unless 
something changes, so you could be hiting the came CPU core every time or the 
same bit of RAM until booted fully.

Or even run SunVTS Validation Test Suite which I belive has a simlar test to 
the cp in /tmp and all the other tests it has.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-18 Thread Damon Atkins
A system with 100TB of data its 80% full and the a user ask their local system 
admin to restore a  directory with large files, as it was 30days ago with all 
Windows/CIFS ACLS and NFSv4/ACLS etc.

If we used zfs send, we need to go back to a zfs send some 30days ago, and find 
80TB of disk space to be able to restore it.

zfs send/recv is great for copy zfs from one zfs file system to another file 
system even across servers. 

But their needs to be a tool:
* To restore an individual file or a zvol (with all ACLs/properties)
* That allows backup vendors (which place backups on tape or disk or CD or ..) 
build indexes of what is contain in the backup (e.g. filename, owner, size 
modification dates, type (dir/file/etc) )
*Stream output suitable for devices like tape drives.
*Should be able to tell if the file is corrupted when being restored.
*May support recovery of corrupt data blocks within the stream.
*Preferable gnutar command-line compatible
*That admins can use to backup and transfer a subset of files e.g user home 
directory (which is not a file system) to another server or on to CD to be sent 
to their new office location, or 

For backup vendors is the idea for them to use NDMP protocol to backup ZFS and 
all its properties/ACLs? Or is a new tool required to achieve the above??

Cheers
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-17 Thread Damon Atkins
I vote for zfs needing a backup and restore command against a snapshot.

backup command should output on stderr at least 
Full_Filename SizeBytes Modification_Date_1970secSigned
so backup software can build indexes and stdout contains the data.

The advantage of zfs providing the command is that as ZFS upgrades or new 
features are added backup vendors do not need to re-test their code. Could also 
mean that when encryption comes a long a property on pool could indicate if it 
is OK to decrypt the filenames only as part of a backup.

restore would work the same way except you would pass a filename or a directory 
to restore etc. And backup software would send back the stream to zfs restore 
command.

The other alternative is for zfs to provide a standard API for backups like 
Oracle does for RMAN.

It would be very useful with snapshots across pools
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6916404
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-03-11 Thread Damon Atkins
pantzer5 wrote:

  These days I am a fan for forward check access
 lists, because any one who
  owns a DNS server can say that for IPAddressX
 returns aserver.google.com.
  They can not set the forward lookup outside of
 their domain  but they can
  setup a reverse lookup. The other advantage is
 forword looking access lists
  is you can use DNS Alias in access lists as well.
 
 That is not true, you have to have a valid A record
 in the correct domain.
I am not sure what this means, unless it indicates every application follows 
the steps outline below. Unfortunately, only a few applications/services do.
 
 This is how it works (and how you should check you
 reverse lookups in
 your applications):
 
 1. Do a reverse lookup.
1b check if the name matches any hosts listed in the access list
 2. Do a lookup with the name from 1.
 3. Check that the IP address is one of the addresses
 you got in 2.
 
 Ignore the reverse lookup if the check in 3 fails.
The above describes a forward lookup check, its just uses reverse lookup to 
determine what forward to lookup.

The other method is when the service starts or re-reads the access list it 
finds A record/IP address for all the names in the access list and keeps a 
record of them, which it uses for checking when a connection comes in, saves 
doing the DNS lookup when a new connection starts, but it means all the DNS 
overhead is at the start.

Unfortunately DNS spoofing exists, which means forward lookups can be poison.

The best (maybe only) way to make NFS secure is NFSv4 and Kerb5 used together.
Cheers
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Should ZFS write data out when disk are idle

2010-03-10 Thread Damon Atkins
 
  For a RaidZ, when data is written to a disk, are
 individual 32k join together to the same disk and
 written out as a single I/O to the disk?
 
 I/Os can be coalesced, but there is no restriction as
 to what can be coalesced.
 In other words, subsequent writes can also be
 coalesced if they are contiguous.
 
  e.g. 128k for file a, 128k for file b, 128k for
 file c.   When written out does zfs do
  32k+32k+32k i/o to each disk, or will it do one 96k
 i/o if the space is available sequentially?
Should have written this, for a 5 disk RaidZ
5x(32k(a)+32k(b)+32k(c) i/o to each disk), or will it attempt to do 
5x(96k(a+b+c)) combind larger I/O to each disk if all allocated blocks for a,b 
and c are sequential on some or every physical disk.
 
 I'm not sure how one could write one 96KB physical
 I/O to three different disks?
I meant to a single disk, three sequential 32k i/o's targeted to the same disk 
becomes a single 96k i/o.  (raidz or even if it was mirrored)
  -- richard
Given you have said ZFS will coalesce contiguous writes together? (???Targeted 
to an individual disk?).
What is the largest physical write ZFS will do to an individual disk?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-03-10 Thread Damon Atkins
In /etc/hosts for the format is
IP FQDN Alias...
Which would means 1.1.1.1 aserver.google.com aserver aserver-le0
I have seen a lot of sysadmins do the following:
1.1.1.1 aserver aserver.google.com
which means the host file (or NIS) does not match DNS

As the first entry is FQDN it is then name return when an application looks 
up an IP address.   In the first example 1.1.1.1  belongs to aserver.google.com 
(FQDN) and access lists need to match this (e.g. .rhost/nfs shares)   

e.g. dig -x 1.1.1.1 | egrep PTR
And it will return FQDN for example aserver.google.com (assuming a standard DNS 
setup)

These days I am a fan for forward check access lists, because any one who owns 
a DNS server can say that for IPAddressX returns aserver.google.com. They can 
not set the forward lookup outside of their domain  but they can setup a 
reverse lookup. The other advantage is forword looking access lists is you can 
use DNS Alias in access lists as well.

e.g. NFS share should do a DNS lookup on aserver.google.com get an IP Address 
or multiple IP Address and then check to see if the client has the same IP 
address rather than a string match.

PS I read in the doco that as of Solaris 10 hostname should be set to FQDN if 
you wish to use Kerb5.
e.g. hostname command should return
aserver.google.com.au not aserver if you wish to use Kerb5 Sol10.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Should ZFS write data out when disk are idle

2010-03-09 Thread Damon Atkins
I am talking about having a write queue, which points to ready to write, full 
stripes.

Ready to write full stripes would be
*The last byte of the full stripe has been updated. 
*The file has been closed for writing. (Exception to the above rule)

I believe there is now a scheduler for ZFS, to handle reads and write conflicts.

For example on a large Multi-Gigabyte NVRAM array, the only big consideration 
is how big is the Fibre Channel pipe is and the limit on outstanding I/Os

But on SATA off the motherboard, then it is about how much RAM cache each disk 
has is a consideration as well as the speed of the SATA connection as well as 
the number of outstanding I/Os

When it comes time to do txg some of the record blocks (most of the full 128k 
ones) will have been written out already. If we have only written out full 
record blocks then there has been no performance loss.

Eventually a txg going to happen, eventually these full writes will need to 
happen, but if we can choose a less busy time for them all the better.

e.g. on a raidz with 5 disks, if I have 128x4 worth of data to write, lets 
write it.
   on a mirror if I have 128k worth to write, lets write it. (record size 
128k), or let it be a tunable for zpool, as some arrays (RAID5) like to have 
larger chunks of data.

Why wait for the txg if the disk are not being pressured for reads. Rather than 
a pause every 30 seconds.

Bob wrote : (I may not have explained it well enough)
It is not true that there is no cost though. Since ZFS uses COW,
this approach requires that new blocks be allocated and written at a
much higher rate. There is also an opportunity cost in that if a
read comes in while these continuous writes are occurring, the read
will be delayed.

At some stage a write needs to happen. **Full** writes have very small COW cost 
compare with small writes. As I said above I talking about a write of 4x128k on 
a 5 disk raidz before the write would happen early. 

There are many applications which continually write/overwrite file
content, or which update a file at a slow pace. For example, log
files are typically updated at a slow rate. Updating a block requires
reading it first (if it is not already cached in the ARC), which can
be quite expensive. By waiting a bit longer, there is a much better
chance that the whole block is overwritten, so zfs can discard the
existing block on disk without bothering to re-read it.

Apps which update at slow pace will not trigger the above early write, until 
they have at least written a record size worth of data, application which write 
slow than 128k (recordsize) in more than 30 secs will never trigger the early 
write on a mirrored disk or even a raidz setup.

What this will catch is the big writer of files greater than 128k (recordsize) 
on mirrored disk; and files larger than (4x128k) on RaidZ 5disks sets.

So that commands like dd if=x of=y bs=512k will not cause issues 
(pauses/delays) when the txg timeout. 

PS I already set zfs:zfs_write_limit_override and I would not recommend anyone 
to set this very low to get the above effect.

It's just an idea on how to prevent the delay effect, it may not be practical?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Should ZFS write data out when disk are idle

2010-03-09 Thread Damon Atkins
Sorry, Full Stripe on a RaidZ is the recordsize ie if the record size is 128k 
on a RaidZ and its made up of 5 disks, then 128k is spread across 4 disks with 
the calc parity on the 5 disk, which means the writes are 32k to each disk.

For a RaidZ, when data is written to a disk, are individual 32k join together 
to the same disk and written out as a single I/O to the disk?
e.g. 128k for file a, 128k for file b, 128k for file c.   When written out does 
zfs do
 32k+32k+32k i/o to each disk, or will it do one 96k i/o if the space is 
available sequentially?

Cheers
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Should ZFS write data out when disk are idle

2010-03-07 Thread Damon Atkins
I think ZFS should look for more opportunities to write to disk rather than 
leaving it to the last second (5seconds) as it appears it does.  e.g.

if a file has record size worth of data outstanding it should be queued within 
ZFS to be written out. If the record is updated again before a txg, then it can 
be re-queued (if it has left the queue) and written to the same block or a new 
block. The write queue would empty when there is spare I/O bandwidth capacity 
and memory capacity on the disk determined thought outstanding I/Os. Once the 
data is on disk it could be free to be re-used even before the txg has 
occurred, but checksum details would need to be recorded first. The txg comes 
along after X seconds and finds most of the data writes have already happen and 
only metadata writes are left to do. 

One would should assume this would help with the delays at txg, talked about in 
this thread.

The example below shows 28 x 128k writes to the same file before anything is 
written to disk and the disk are idle the entire time. There is no cost to 
writing to disk if the disk is not doing anything or is under capacity. (Not a 
perfect example)


At the other end maybe updates for access time properties should not be updated 
to disk until there is some real data to write, or 30minutes has passed to 
allow green disks to power down for a while.  (atime= on|off|delay) 

Cheers

No dedup on, but compression on
while sleep 1 ; do echo `dd if=/dev/random of= bs=128k count=1 21` ; done 

iostat -zxcnT d 1
  us sy wt id
  0  5  0 94
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0   53.00.0  301.5  0.0  0.20.03.4   0   4 c5t0d0
0.0   53.00.0  301.5  0.0  0.20.03.1   0   4 c5t2d0
0.0   58.00.0  127.0  0.0  0.00.00.1   0   0 c5t1d0
0.0   58.00.0  127.0  0.0  0.00.00.1   0   0 c5t3d0
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:41 PM EST
 cpu
 us sy wt id
  0  4  0 96
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.03.00.02.0  0.0  0.00.00.5   0   0 c5t0d0
0.03.00.02.0  0.0  0.00.00.5   0   0 c5t2d0
0.01.00.00.0  0.0  0.00.00.0   0   0 c5t1d0
0.01.00.00.0  0.0  0.00.00.0   0   0 c5t3d0
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:42 PM EST
 cpu
 us sy wt id
  1  3  0 96
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:43 PM EST
 cpu
 us sy wt id
  0  4  0 96
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:44 PM EST
 cpu
 us sy wt id
  0  4  0 96
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:45 PM EST
 cpu
 us sy wt id
  0  4  0 96
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:46 PM EST
 cpu
 us sy wt id
  1  4  0 95
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:47 PM EST
 cpu
 us sy wt id
  0  4  0 96
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:48 PM EST
 cpu
 us sy wt id
  0 19  0 80
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:49 PM EST
 cpu
 us sy wt id
  1 27  0 72
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:50 PM EST
 cpu
 us sy wt id
  0  3  0 96
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:51 PM EST
 cpu
 us sy wt id
  1  3  0 96
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:52 PM EST
 cpu
 us sy wt id
  0  4  0 95
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0+1 records in 0+1 records out
Monday,  8 March 2010 02:51:53 PM EST
 cpu
 us sy wt id
  0  4  0 96
extended device statistics
 

Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Damon Atkins
Create a new empty pool on the solaris system, let it format the disks etc
ie used the disk names cXtXd0 This should put the EFI label on the disks and 
then setup the partitions for you. Just encase here is an example.

Go back to the Linux box, and see if you can use tools to see the same 
partition layout, if you can then dd it to the currect spot which in Solaris 
c5t2d0s0. (zfs send|zfs recv would be easier)

-bash-4.0$ pfexec fdisk -R -W - /dev/rdsk/c5t2d0p0

* /dev/rdsk/c5t2d0p0 default fdisk table
* Dimensions:
*512 bytes/sector
*126 sectors/track
*255 tracks/cylinder
*   60800 cylinders
*
* systid:
*1: DOSOS12
*  238: EFI_PMBR
*  239: EFI_FS
*
* IdAct  Bhead  Bsect  BcylEhead  Esect  EcylRsect  Numsect
  238   025563 102325563 10231  1953525167
  0 00  0  0   0  0  0   0  0
  0 00  0  0   0  0  0   0  0
  0 00  0  0   0  0  0   0  0


-bash-4.0$ pfexec prtvtoc /dev/rdsk/c5t2d0
* /dev/rdsk/c5t2d0 partition map
*
* Dimensions:
* 512 bytes/sector
* 1953525168 sectors
* 1953525101 accessible sectors
*
* Flags:
*   1: unmountable
*  10: read-only
*
* Unallocated space:
*   First SectorLast
*   Sector CountSector
*  34   222   255
*
*  First Sector   Last
* Partition  Tag   FlagsSector Count   Sector  Mount Directory
   0  4 00  2561953508495 1953508750
   8 1100  1953508751  163841953525134
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?

2010-02-08 Thread Damon Atkins
May be look at rsync and  rsync lib (http://librsync.sourceforge.net/) code to 
see if a ZFS API could be design to help rsync/librsync in the future as well 
as diff.

It might be a good idea for POSIX to have a single checksum and a 
multi-checksum interface.

One problem could be block sizes, if a file is re-written and is the same size 
it may have different ZFS record sizes within, if it was written over a long 
period of time (txg's)(ignoring compression), and therefore you could not use 
ZFS checksum to compare two files.

Side Note:
It would be nice if ZFS on every txg only wrote full record sizes unless it was 
short on memory, or a file was closed. Maybe the txg could happen more often if 
it just scanned for full recordsize's writes and closed files. Or block which 
had not be altered for three scan's.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?

2010-02-08 Thread Damon Atkins
I would have thought that if I write 1k then ZFS txg times out in 30secs, then 
the 1k will be written to disk in a 1k record block, and then if I write 4k 
then 30secs latter txg happen another 4k record size block will be written, and 
then if I write 130k a 128k and 2k record block will be written.

Making the file have record sizes of
1k+4k+128k+2k
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] send/received inherited bug?, received overrides parent, snv_130 6920906

2010-01-29 Thread Damon Atkins
Here is the output
-bash-4.0# uname -a
SunOS 5.11 snv_130 i86pc i386 i86pc

-bash-4.0# zfs get -r -o all compression  mainfs01 | egrep -v \@
NAMEPROPERTY VALUE RECEIVED  SOURCE
mainfs01   compression  gzip-3- local
mainfs01/home compression  gzip-3lzjb  local
mainfs01/mysql compression  gzip-3- inherited from mainfs01

-bash-4.0# zfs inherit compression mainfs01/home
-bash-4.0# zfs get -r -o all compression  mainfs01 | egrep -v \@
NAMEPROPERTY VALUE RECEIVED  SOURCE
mainfs01compression  gzip-3- local
mainfs01/home  compression  lzjb  lzjb  received
mainfs01/mysql  compression  gzip-3- inherited from mainfs01

-bash-4.0# zfs inherit -S compression mainfs01/home
-bash-4.0# zfs get -r -o all compression  mainfs01 | egrep -v \@
NAMEPROPERTY VALUE RECEIVED  SOURCE
mainfs01compression  gzip-3- local
mainfs01/home  compression  lzjb  lzjb  received
mainfs01/mysql  compression  gzip-3- inherited from mainfs01

-bash-4.0# zfs inherit compression mainfs01/home
-bash-4.0# zfs get -r -o all compression  mainfs01 | egrep -v \@
NAMEPROPERTY VALUE RECEIVED  SOURCE
mainfs01compression  gzip-3- local
mainfs01/home  compression  lzjb  lzjb  received
mainfs01/mysql  compression  gzip-3- inherited from mainfs01

How do I get this to say
NAMEPROPERTY VALUE RECEIVED  SOURCE
mainfs01  compression  gzip-3- local
mainfs01/homecompression  [b]gzip-3[/b]  lzjb  [b]inherited from 
mainfs01[/b]
mainfs01/mysqlcompression  gzip-3- inherited from mainfs01

Cheers
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool fragmentation issues? (dovecot)

2010-01-16 Thread Damon Atkins
In my previous post I was refering more to mdbox (Multi-dbox) rather than dbox, 
however I beleive the meta data is store with the mail msg in version 1.x and 
2.x meta is not updated within the msg which would be better for ZFS.

What I am saying is msg per file which is not updated is better for snapshots.  
I belive 2.x version of single-dbox should be better (ie meta data is no longer 
stored with the msg) compared with 1.x dbox for snapshots. 

Cheers
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Practical) limit on the number of snapshots?

2010-01-11 Thread Damon Atkins
One thing which may help is the zfs import was single threaded, ie it open 
every disk one disk (maybe slice) at a time and processed it, as of 128b it is 
multi-threaded, ie it opens N disks/slices at once and process N disks/slices 
at once. When N is the number of threads it decides to use.

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191

This most like/maybe cause other parts of the process to now  become 
multi-threaded as well.

It would be nice to no longer have /etc/zfs/zpool.cache, now zfs import is fast 
enough. (which is a second reason I longed the bug)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Import a SAN cloned disk

2009-12-16 Thread Damon Atkins
Before Veritas VM had support for this, you needed to use a different server to 
import a disk group. You could use a different server for ZFS, which will also 
take the backup load off the Server?

Cheers
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Transaction consistency of ZFS

2009-12-07 Thread Damon Atkins
Because ZFS is transaction, (effectively preserves order), the rename trick 
will work.
If you find the .filename delete create a new .filename and when finish 
writing rename it to filename. If filename exists you no all writes were 
completed. If you have a batch system which looks for the file it will not find 
it until it is renamed. Not that I am a  of batch systems which use CPU poll 
for files existance.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Accidentally added disk instead of attaching

2009-12-07 Thread Damon Atkins
What about removing attach/deattach and replace it with
zpool add [-fn] 'pool' submirror 'device/mirrorname' 'new_device'
e.g.
NAMESTATE READ WRITE CKSUM
rpoolONLINE   0 0 0
  mirror-01 ONLINE   0 0 0
c4d0s0  ONLINE   0 0 0
c3d0s0  ONLINE   0 0 0
zpool add rpool submirror mirror-01 c5d0s0 # or
zpool add rpool submirror c4d0s0 c5d0s0
zpool remove rpool c5d0s0
Some more examples
zpool add 'pool' submirror log-01 c7d0s0  # create a mirror for the Intent Log 
And may be one day zpool add 'pool' subraidz raidz2-01 c5d0s0 to add extra disk 
to raidz group and have the disk restriped in the background

Which would mean vdev in terms of syntax would support
concat (was disk), concat-file (was file), mirror, submirror, raidz, raidzN, 
subraidz (one day), spare, log, cache
--
And change 
zpool add rpool disk c5d0s0
to
zpool add rpool concat c5d0s0  # instead of disk use concat or 
zpool add rpool concatfile path to file # instead of file

Cheers
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Transaction consistency of ZFS

2009-12-05 Thread Damon Atkins
If power failure happens you will lose anything in cache. So you could lose the 
entire file on power failure if the system is not busy (ie ZFS does delay 
writes, unless you do a fsync before closing the file).  I would still like to 
see a file system option sync on close or even wait for txg on close

Some of the best methods are to create a temp file  e.g. .download.filename 
and rename when the download (or what ever) is sucessfull to filename Or 
create a extra empty file to say it has been completed e.g. filename.dn. I 
prefer the rename trick.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpools on USB zpool.cache zpool import

2009-03-24 Thread Damon Atkins
The zpool.cache file makes clustering complex. {Assume the man page is 
still correct}


From the zpool man page:

cachefile=path | none

Controls the location of where the pool configuration is cached. 
Discovering all pools on system startup requires a cached copy of the 
configuration data that  is  stored on  the  root  file  system. All 
pools in this cache are automatically  imported  when  the  system  boots. 

 Some  environments,  such  as  install and clustering, need to 
cache this information in a different location so that pools are not 
automatically imported. *


Setting this property caches the pool configuration in a different 
location  that can later be imported with zpool import -c.
... When the last pool using  a cache file  is  exported  or  
destroyed,  the  file  is removed.


zpool import [-d dir | -c cachefile] [-D]

Lists pools available to import. If the -d option is not
specified,   this   command   searches  for  devices  in
/dev/dsk.
--
A truss of zpool import indicates that it is not multi-threaded when 
scanning for disks. ie. it scans 1 at a time instead of X at a time. So 
it does take a while to run. Would be nice if this was multi-threaded.


If the cache file is to stay, it should do a scan of /dev to fix itself 
at boot if something is wrong, and report it is doing a scan to the 
console. esp if it is not multi-threaded.


PS it would be nice to have a zpool diskinfo devicepath reports  if 
the device belongs to a zpool imported or not, and all the details about 
any zpool it can find on the disk. e.g. file-systems (zdb is only for 
ZFS engineers says the man page). 'zpool import' needs an option to 
list the file systems of a pool which is not yet imported and its 
properties so you can have more information about it before importing it.


Cheers
 Original Message 



On Mon, Mar 23, 2009 at 4:45 PM, Mattias Pantzare pantz...@gmail.com 
mailto:pantz...@gmail.com wrote:




If I put my disks on a diffrent controler zfs won't find them when I
boot. That is bad. It is also an extra level of complexity.


Correct me if I'm wrong, but wading through all of your comments, I 
believe what you would like to see is zfs automatically scan if the 
cache is invalid vs. requiring manual intervention, no? 

It would seem to me this would be rather sane behavior and a 
legitimate request to add this as an option.


--Tim



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpools on USB zpool.cache

2009-03-23 Thread Damon Atkins
Do we still need the  zpool.cache still. I believe early versions of 
zpool used the cache to remember what zpools to import at boot.
I understand newer versions of zfs still use the cache but also check to 
see if the pool contains the correct host name of the server, and will 
only import if the hostname matches.


I suggest ZFS at boot should (multi-threaded) scan every disk for  ZFS 
disks, and import the ones with the correct host name and with a  import 
flag set, without using the cache file. Maybe just use the cache file 
for non-EFI disk/partitions, but without the storing the pool name, but 
you should be able to tell ZFS to do a full scan which includes 
partition disk.


Cheers

 Original Message 

ZFS maintains a cache of what pools were imported so that at boot time,
it will automatically try to re-import the pool.  The file is 
/etc/zfs/zpool.cache

and you can view its contents by using zdb -C

If the current state of affairs does not match the cache, then you can
export the pool, which will clear its entry in the cache.  Then retry the
import.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpools on USB zpool.cache zpool import

2009-03-23 Thread Damon Atkins

The zpool.cache file makes clustering complex. {Assume the man page is
still correct}

From the zpool man page:

cachefile=path | none

Controls the location of where the pool configuration is cached.
Discovering all pools on system startup requires a cached copy of the
configuration data that  is  stored on  the  root  file  system. All
pools in this cache are automatically  imported  when  the  system  boots.

 Some  environments,  such  as  install and clustering, need to
cache this information in a different location so that pools are not
automatically imported. *

Setting this property caches the pool configuration in a different
location  that can later be imported with zpool import -c.
... When the last pool using  a cache file  is  exported  or
destroyed,  the  file  is removed.

zpool import [-d dir | -c cachefile] [-D]

Lists pools available to import. If the -d option is not
specified,   this   command   searches  for  devices  in
/dev/dsk.
--
A truss of zpool import indicates that it is not multi-threaded when
scanning for disks. ie. it scans 1 at a time instead of X at a time. So
it does take a while to run. Would be nice if this was multi-threaded.

If the cache file is to stay, it should do a scan of /dev to fix itself
at boot if something is wrong, and report it is doing a scan to the
console. esp if it is not multi-threaded.

PS it would be nice to have a zpool diskinfo devicepath reports  if
the device belongs to a zpool imported or not, and all the details about
any zpool it can find on the disk. e.g. file-systems (zdb is only for
ZFS engineers says the man page). 'zpool import' needs an option to
list the file systems of a pool which is not yet imported and its
properties so you can have more information about it before importing it.

Cheers
 Original Message 



On Mon, Mar 23, 2009 at 4:45 PM, Mattias Pantzare pantz...@gmail.com 
mailto:pantz...@gmail.com wrote:




If I put my disks on a diffrent controler zfs won't find them when I
boot. That is bad. It is also an extra level of complexity.


Correct me if I'm wrong, but wading through all of your comments, I 
believe what you would like to see is zfs automatically scan if the 
cache is invalid vs. requiring manual intervention, no? 

It would seem to me this would be rather sane behavior and a 
legitimate request to add this as an option.


--Tim




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] device alias

2007-09-26 Thread Damon Atkins

ZFS should allow 31+NULL chars for a comment against each disk.
This would work well with the host name string (I assume is max_hostname 
255+NULL)
If a disk fails it should report c6t4908029d0 failed comment from 
disk, it should also remember the comment until reboot

This would be useful for DR, or in clusters.  By the operator giving a 
disk a comment they can check it's existence on a different server, work 
out which one is missing and fix it before doing an import.  You would 
also need a command to dump it out without importing a disk. In fact it 
would be nice to have a tool that check to see if a disk is a zfs disk 
and print out its info without the need to import it. 

Cheers
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wish List

2007-08-17 Thread Damon Atkins

Close Sync on file systems option (ie when the app calls close the file 
is flushed, including mmap, no data loss of closed files on system 
crash) Atomic/Locked operations across all pools e.g. snapshot all or 
selected pools at the same time.
Allowance for offline files, eg. first part of a file can be on disk, 
the last part can be on disk, the rest on tape/cd/dvd/blue-ray etc.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with HDS TrueCopy and EMC SRDF

2007-08-03 Thread Damon Atkins
Date: Thu, 26 Jul 2007 20:39:09 PDT
From: Anton B. Rang

That said, I?m not sure exactly what this buys you for disk replication. 
What?s special about files which have been closed? Is the point that 
applications might close a file and then notify some other process of the 
file?s availability for use?

Yes


E.g. 1
Program starts output job,and completes job in OS Cache on Server A. Server A 
tells batch scheduling software on Server B, that job is complete. Server A 
Crashes, file no longer exists or is truncated due to what is left in the OS 
Cache. Server B Schedules the next job, on the assumption that the file creates 
on Server A is ok.

E.g. 2
Program starts output job,and completes job in OS Cache on Server A. A DB on 
Server A running in a different ZFS Pool, updates a DB record to record the 
fact the output is complete (DB uses O_DSYNC) Server A Crashes, file no longer 
exists or is truncated due to what is left in the OS Cache. Server A DB 
contains information saying that the file is completed.

I believe that sync-on-close should be the default. File systems integrity 
should be more than just being able to read a file which has been truncated due 
to a system crash/power failure etc.

E.g. 3 (a bit cheeky -:)
$ vi  a file, save the file, system crashes, you look back at the screen 
and you say thank god, I save the file in time, because on your screen in the 
prompt $ again. This is all happening in the OS Cache file. When the system 
returns the file does not exist. (I am ignoring vi -r)
$ vi x
$ connection lost
Therefore users should do
$ vi x
$ sleep 5 ; echo file x now on disk :-)
$ echo add a line  x
$ sleep 5; echo update to x complete

UFS forcedirectio and VxFS closesync ensure that what ever happens your files 
will always exist if the program completes. Therefore with Disk Replication 
(sync) the file exists at the other site at its finished size. When you 
introduce DR with Disk Replication, general means you can not afford to lose 
any save data. UFS forcedirectio has a larger performance hit than VxFS 
closesync.

Cheers
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with HDS TrueCopy and EMC SRDF [CloseSync]

2007-07-27 Thread Damon Atkins
Date: Thu, 26 Jul 2007 20:39:09 PDT
From: Anton B. Rang [EMAIL PROTECTED]

That said, I?m not sure exactly what this buys you for disk replication. 
What?s special about files which have been closed? Is the point that 
applications might close a file and then notify some other process of the 
file?s availability for use?

Yes


E.g. 1
Program starts output job,and completes job in OS Cache on Server A. Server A 
tells batch scheduling software on Server B, that job is complete. Server A 
Crashes, file no longer exists or is truncated due to what is left in the OS 
Cache. Server B Schedules the next job, on the assumption that the file creates 
on Server A is ok.

E.g. 2
Program starts output job,and completes job in OS Cache on Server A. A DB on 
Server A running in a different ZFS Pool, updates a DB record to record the 
fact the output is complete (DB uses O_DSYNC)  Server A Crashes, file no longer 
exists or is truncated due to what is left in the OS Cache. Server A DB 
contains information saying that the file is completed.

I believe that sync-on-close should be the default.  File systems integrity 
should be more than just being able to read a file which has been truncated due 
to a system crash/power failure etc.

E.g. 3 (a bit cheeky -:)
$ vi  a file, save the file, system crashes, you look back at the screen 
and you say thank god, I save the file in time, because on your screen in the 
prompt $ again. This is all happening in the OS Cache file. When the system 
returns the file does not exist. (I am ignoring vi -r)
$ vi x
$ connection lost
Therefore users should do
$ vi x
$ sleep 5 ; echo file x now on disk :-)
$ echo add a line  x
$ sleep 5; echo update to x complete

UFS forcedirectio and VxFS closesync ensure that what ever happens your files 
will always exist if the   program completes. Therefore with Disk Replication 
(sync) the file exists at the other site at its finished size. When you 
introduce DR with Disk Replication, general means you can not afford to lose 
any save data.  UFS forcedirectio has a larger performance hit than VxFS 
closesync.

Cheers



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS with HDS TrueCopy and EMC SRDF

2007-07-26 Thread Damon Atkins
Guys,
   What is the best way to ask for a feature enhancement to ZFS.

To allow ZFS to be usefull for DR disk replication, we need to be able 
set an option against the pool or file system or both, called close 
sync. ie When a programme closes a file any outstanding writes are flush 
to disk, before the close returns to the programme.  So when a programme 
ends you are guarantee any state information is save to the disk. 
(exit() also results in close being called)

open(xxx, O_DSYNC) is only good if you can alter the source code.  Shell 
scripts use of awk, head, tail, echo etc to create output  files do not 
use O_DSYNC, when the shell script returns 0, you want to know that all 
the data is on the disk, so if the system crashes the data is still there.

PS it would be nice if  UFS had closessync as well, instead of using 
forcedirectio.

Cheers
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss