from:"mike"

Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?

2013-02-20 Thread Mike Gerdts

On Wed, Feb 20, 2013 at 4:49 PM, Markus Grundmann mar...@freebsduser.eu wrote:
 Whenever I modify zfs pools or filesystems it's possible to destroy [on a
 bad day :-)] my data. A new
 property protected=on|off in the pool and/or filesystem can help the
 administrator for datalost
 (e.g. zpool destroy tank or zfs destroy tank/filesystem command will
 be rejected
 when protected=on property is set).

 It's anywhere here on this list their can discuss/forward this feature
 request? I hope you have
 understand my post ;-)

I like the idea and it is likely not very hard to implement.  This is
very similar to how snapshot holds work.

# zpool upgrade -v | grep -i hold
 18  Snapshot user holds

So long as you aren't using a really ancient zpool version, you could
use this feature to protect your file systems.

# zfs create a/b
# zfs snapshot a/b@snap
# zfs hold protectme a/b@snap
# zfs destroy a/b
cannot destroy 'a/b': filesystem has children
use '-r' to destroy the following datasets:
a/b@snap
# zfs destroy -r a/b
cannot destroy 'a/b@snap': snapshot is busy

Of course, snapshots aren't free if you write to the file system.  A
way around that is to create an empty file system within the one that
you are trying to protect.

# zfs create a/1
# zfs create a/1/hold
# zfs snapshot a/1/hold@hold
# zfs hold 'saveme!' a/1/hold@hold
# zfs holds a/1/hold@hold
NAME   TAG  TIMESTAMP
a/1/hold@hold  saveme!  Wed Feb 20 15:06:29 2013
# zfs destroy -r a/1
cannot destroy 'a/1/hold@hold': snapshot is busy

Extending the hold mechanism to filesystems and volumes would be quite nice.

Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread Traffanstead, Mike

vfs.zfs.txg.synctime_ms: 1000
vfs.zfs.txg.timeout: 5

On Thu, Jul 19, 2012 at 8:47 PM, John Martin john.m.mar...@oracle.com wrote:
 On 07/19/12 19:27, Jim Klimov wrote:

 However, if the test file was written in 128K blocks and then
 is rewritten with 64K blocks, then Bob's answer is probably
 valid - the block would have to be re-read once for the first
 rewrite of its half; it might be taken from cache for the
 second half's rewrite (if that comes soon enough), and may be
 spooled to disk as a couple of 64K blocks or one 128K block
 (if both changes come soon after each other - within one TXG).


 What are the values for zfs_txg_synctime_ms and zfs_txg_timeout
 on this system (FreeBSD, IIRC)?


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Benefits of enabling compression in ZFS for the zones

2012-07-10 Thread Mike Gerdts

On Tue, Jul 10, 2012 at 6:29 AM, Jordi Espasa Clofent
jespa...@minibofh.org wrote:
 Thanks for you explanation Fajar. However, take a look on the next lines:

 # available ZFS in the system

 root@sct-caszonesrv-07:~# zfs list

 NAME USED  AVAIL  REFER  MOUNTPOINT
 opt  532M  34.7G   290M  /opt
 opt/zones243M  34.7G32K  /opt/zones
 opt/zones/sct-scw02-shared   243M  34.7G   243M  /opt/zones/sct-scw02-shared
 static   104K  58.6G34K  /var/www/

 # creating a file in /root (UFS)

 root@sct-caszonesrv-07:~# dd if=/dev/zero of=file.bin count=1024 bs=1024
 1024+0 records in
 1024+0 records out
 1048576 bytes (1.0 MB) copied, 0.0545957 s, 19.2 MB/s
 root@sct-caszonesrv-07:~# pwd
 /root

 # enable compression in some ZFS zone

 root@sct-caszonesrv-07:~# zfs set compression=on opt/zones/sct-scw02-shared

 # copying the previos file to this zone

 root@sct-caszonesrv-07:~# cp /root/file.bin
 /opt/zones/sct-scw02-shared/root/

 # checking the file size in the origin dir (UFS) and the destination one
 (ZFS with compression enabled)

 root@sct-caszonesrv-07:~# ls -lh /root/file.bin
 -rw-r--r-- 1 root root 1.0M Jul 10 13:21 /root/file.bin

 root@sct-caszonesrv-07:~# ls -lh /opt/zones/sct-scw02-shared/root/file.bin
 -rw-r--r-- 1 root root 1.0M Jul 10 13:22
 /opt/zones/sct-scw02-shared/root/file.bin

 # the both files has exactly the same cksum!

 root@sct-caszonesrv-07:~# cksum /root/file.bin
 3018728591 1048576 /root/file.bin

 root@sct-caszonesrv-07:~# cksum /opt/zones/sct-scw02-shared/root/file.bin
 3018728591 1048576 /opt/zones/sct-scw02-shared/root/file.bin

 So... I don't see any size variation with this test.

ls(1) tells you how much data is in the file - that is, how many bytes
of data that an application will see if it reads the whole file.
du(1) tells you how many disk blocks are used.  If you look at the
stat structure in stat(2), ls reports st_size, du reports st_blocks.

Blocks full of zeros are special to zfs compression - it recognizes
them and stores no data.  Thus, a file that contains only zeros will
only require enough space to hold the file metadata.

$ zfs list -o compression ./
COMPRESS
  on

$ dd if=/dev/zero of=1gig count=1024 bs=1024k
1024+0 records in
1024+0 records out

$ ls -l 1gig
-rw-r--r--   1 mgerdts  staff1073741824 Jul 10 07:52 1gig

$ du -k 1gig
0   1gig

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Mike Gerdts

On Tue, Jun 12, 2012 at 11:17 AM, Sašo Kiselkov skiselkov...@gmail.com wrote:
 On 06/12/2012 05:58 PM, Andy Bowers - Performance Engineering wrote:
 find where your nics are bound too

 mdb -k
 ::interrupts

 create a processor set including those cpus [ so just the nic code will
 run there ]

 andy

 Tried and didn't help, unfortunately. I'm still seeing drops. What's
 even funnier is that I'm seeing drops when the machine is sync'ing the
 txg to the zpool. So looking at a little UDP receiver I can see the
 following input stream bandwidth (the stream is constant bitrate, so
 this shouldn't happen):

If processing in interrupt context (use intrstat) is dominating cpu
usage, you may be able to use pcitool to cause the device generating
all of those expensive interrupts to be moved to another CPU.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Strange hang during snapshot receive

2012-05-10 Thread Mike Gerdts

On Thu, May 10, 2012 at 5:37 AM, Ian Collins i...@ianshome.com wrote:
 I have an application I have been using to manage data replication for a
 number of years.  Recently we started using a new machine as a staging
 server (not that new, an x4540) running Solaris 11 with a single pool built
 from 7x6 drive raidz.  No dedup and no reported errors.

 On that box and nowhere else is see empty snapshots taking 17 or 18 seconds
 to write.  Everywhere else they return in under a second.

 Using truss and the last published source code, it looks like the pause is
 between a printf and  the call to zfs_ioctl and there aren't any other
 functions calls between them:

For each snapshot in a stream, there is one zfs_ioctl() call.  During
that time, the kernel will read the entire substream (that is, for one
snapshot) from the input file descriptor.


 100.5124     0.0004    open(/dev/zfs, O_RDWR|O_EXCL)            = 10
 100.7582     0.0001    read(7, \0\0\0\0\0\0\0\0ACCBBAF5.., 312)    = 312
 100.7586     0.    read(7, 0x080464F8, 0)                = 0
 100.7591     0.    time()                        = 1336628656
 100.7653     0.0035    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040CF0)    = 0
 100.7699     0.0022    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040900)    = 0
 100.7740     0.0016    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040580)    = 0
 100.7787     0.0026    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x080405B0)    = 0
 100.7794     0.0001    write(1,  r e c e i v i n g   i n.., 75)    = 75
 118.3551     0.6927    ioctl(8, ZFS_IOC_RECV, 0x08042570)        = 0
 118.3596     0.0010    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040900)    = 0
 118.3598     0.    time()                        = 1336628673
 118.3600     0.    write(1,  r e c e i v e d   3 1 2.., 45)    = 45

 zpool iostat (1 second interval) for the period is:

 tank        12.5T  6.58T    175      0   271K      0
 tank        12.5T  6.58T    176      0   299K      0
 tank        12.5T  6.58T    189      0   259K      0
 tank        12.5T  6.58T    156      0   231K      0
 tank        12.5T  6.58T    170      0   243K      0
 tank        12.5T  6.58T    252      0   295K      0
 tank        12.5T  6.58T    179      0   200K      0
 tank        12.5T  6.58T    214      0   258K      0
 tank        12.5T  6.58T    165      0   210K      0
 tank        12.5T  6.58T    154      0   178K      0
 tank        12.5T  6.58T    186      0   221K      0
 tank        12.5T  6.58T    184      0   215K      0
 tank        12.5T  6.58T    218      0   248K      0
 tank        12.5T  6.58T    175      0   228K      0
 tank        12.5T  6.58T    146      0   194K      0
 tank        12.5T  6.58T     99    258   209K  1.50M
 tank        12.5T  6.58T    196    296   294K  1.31M
 tank        12.5T  6.58T    188    130   229K   776K

 Can anyone offer any insight or further debugging tips?

I have yet to see a time when zpool iostat tells me something useful.
I'd take a look at iostat -xzn 1 or similar output.  It could point
to imbalanced I/O or a particular disk that has abnormally high
service times.

Have you installed any SRUs?  If not, you could be seeing:

7060894 zfs recv is excruciatingly slow

which is fixed in Solaris 11 SRU 5.

If you are using zones and are using any https pkg(5) origins (such as
https://pkg.oracle.com/solaris/support), I suggest reading
https://forums.oracle.com/forums/thread.jspa?threadID=2380689tstart=15
before updating to SRU 6 (SRU 5 is fine, however).  The fix for the
problem mentioned in that forums thread should show up in an upcoming
SRU via CR 7157313.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] test for holes in a file?

2012-03-26 Thread Mike Gerdts

2012/3/26 ольга крыжановская olga.kryzhanov...@gmail.com:
 How can I test if a file on ZFS has holes, i.e. is a sparse file,
 using the C api?

See SEEK_HOLE in lseek(2).

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] test for holes in a file?

2012-03-26 Thread Mike Gerdts

On Mon, Mar 26, 2012 at 6:18 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Mon, 26 Mar 2012, Andrew Gabriel wrote:

 I just played and knocked this up (note the stunning lack of comments,
 missing optarg processing, etc)...
 Give it a list of files to check...


 This is a cool program, but programmers were asking (and answering) this
 same question 20+ years ago before there was anything like SEEK_HOLE.

 If file space usage is less than file directory size then it must contain a
 hole.  Even for compressed files, I am pretty sure that Solaris reports the
 uncompressed space usage.

That's not the case.

# zfs create -o compression=on rpool/junk
# perl -e 'print foo x 10' /rpool/junk/foo
# ls -ld /rpool/junk/foo
-rw-r--r--   1 root root  30 Mar 26 18:25 /rpool/junk/foo
# du -h /rpool/junk/foo
  16K   /rpool/junk/foo
# truss -t stat -v stat du  /rpool/junk/foo
...
lstat64(foo, 0x08047C40)  = 0
d=0x02B90028 i=8 m=0100644 l=1  u=0 g=0 sz=30
at = Mar 26 18:25:25 CDT 2012  [ 1332804325.742827733 ]
mt = Mar 26 18:25:25 CDT 2012  [ 1332804325.889143166 ]
ct = Mar 26 18:25:25 CDT 2012  [ 1332804325.889143166 ]
bsz=131072 blks=32fs=zfs

Notice that it says it has 32 512 byte blocks.

The mechanism you suggest does work for every other file system that
I've tried it on.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Any rhyme or reason to disk dev names?

2011-12-21 Thread Mike Gerdts

On Wed, Dec 21, 2011 at 1:58 AM, Matthew R. Wilson
mwil...@mattwilson.org wrote:
 Hello,

 I am curious to know if there is an easy way to guess or identify the device
 names of disks. Previously the /dev/dsk/c0t0d0s0 system made sense to me...
 I had a SATA controller card with 8 ports, and they showed up with the
 numbers 1-8 in the t position of the device name.

 But I just built a new system with two LSI SAS HBAs in it, and my device
 names are along the lines of:
 /dev/dsk/c0t5000CCA228C0E488d0

 I could not find any correlation between that identifier and the a)
 controller the disk was plugged in to, or b) the port number on the
 controller. The only way I could make a mapping of device name to controller
 port was to add one drive at a time, reboot the system, and run format to
 see which new disk name shows up.

 I'm guessing there's a better way, but I can't find any obvious answer as to
 how to determine which port on my LSI controller card will correspond with
 which seemingly random device name. Can anyone offer any suggestions on a
 way to predict the device naming, or at least get the system to list the
 disks after I insert one without rebooting?

Depending on the hardware you are using, you may be able to benefit
from croinfo.

$ croinfo
D:devchassis-path  t:occupant-type  c:occupant-compdev
-  ---  -
/dev/chassis//SYS/SASBP/HDD0/disk  disk c0t5000CCA012B66E90d0
/dev/chassis//SYS/SASBP/HDD1/disk  disk c0t5000CCA012B68AC8d0

The text in the left column represents text that should be printed on
the corresponding disk slots.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] gaining access to var from a live cd

2011-11-29 Thread Mike Gerdts

On Tue, Nov 29, 2011 at 3:01 PM, Francois Dion francois.d...@gmail.com wrote:
 I've hit an interesting (not) problem. I need to remove a problematic
 ld.config file (due to an improper crle...) to boot my laptop. This is
 OI 151a, but fundamentally this is zfs, so i'm asking here.

 what I did after booting the live cd and su:
 mkdir /tmp/disk
 zpool import -R /tmp/disk -f rpool

 export shows up in there and rpool also, but in rpool there is only
 boot and etc.

 zfs list shows rpool/ROOT/openindiana as mounted on /tmp/disk and I
 see dump and swap, but no var. rpool/ROOT shows as legacy, so I
 figured, maybe mount that.

 mount -F zfs rpool/ROOT /mnt/rpool

That dataset (rpool/ROOT) should never have any files in it.  It is
just a container for boot environments.  You can see which boot
environments exist with:

zfs list -r rpool/ROOT

If you are running Solaris 11, the boot environment's root dataset
will show a mountpoint property value of /.  Assuming it is called
solaris you can mount it with:

zfs mount -o mountpoint=/mnt/rpool rpool/ROOT/solaris

If the system is running Solaris 11 (and was not updated from Solaris
11 Express), it will have a separate /var dataset.

zfs mount -o mountpoint=/mnt/rpool/var rpool/ROOT/solaris/var

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] gaining access to var from a live cd

2011-11-29 Thread Mike Gerdts

On Tue, Nov 29, 2011 at 4:40 PM, Francois Dion francois.d...@gmail.com wrote:
 It is on openindiana 151a, no separate /var as far as But I'll have to
 test this on solaris11 too when I get a chance.

 The problem is that if I

 zfs mount -o mountpoint=/tmp/rescue (or whatever) rpool/ROOT/openindiana

 i get a cannot mount /mnt/rpool: directory is not empty.

 The reason for that is that I had to do a zpool import -R /mnt/rpool
 rpool (or wherever I mount it it doesnt matter) before I could do a
 zfs mount, else I dont have access to the rpool zpool for zfs to do
 its thing.

 chicken / egg situation? I miss the old fail safe boot menu...

You can mount it pretty much anywhere:

mkdir /tmp/foo
zfs mount -o mountpoint=/tmp/foo ...

I'm not sure when the temporary mountpoint option (-o mountpoint=...)
came in. If it's not valid syntax then:

mount -F zfs rpool/ROOT/solaris /tmp/foo

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FS Reliability WAS: about btrfs and zfs

2011-10-21 Thread Mike Gerdts

On Fri, Oct 21, 2011 at 8:02 PM, Fred Liu fred_...@issi.com wrote:

 3. Do NOT let a system see drives with more than one OS zpool at the
 same time (I know you _can_ do this safely, but I have seen too many
 horror stories on this list that I just avoid it).


 Can you elaborate #3? In what situation will it happen?

Some people have trained their fingers to use the -f option on every
command that supports it to force the operation.  For instance, how
often do you do rm -rf vs. rm -r and answer questions about every
file?

If various zpool commands (import, create, replace, etc.) are used
against the wrong disk with a force option, you can clobber a zpool
that is in active use by another system.  In a previous job, my lab
environment had a bunch of LUNs presented to multiple boxes.  This was
done for convenience in an environment where there would be little
impact if an errant command were issued.  I'd never do that in
production without some form of I/O fencing in place.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Kernel panic on zpool import. 200G of data inaccessible!

2011-08-05 Thread Mike Gerdts

On Thu, Aug 4, 2011 at 2:47 PM, Stuart James Whitefish
swhitef...@yahoo.com wrote:
 # zpool import -f tank

 http://imageshack.us/photo/my-images/13/zfsimportfail.jpg/

I encourage you to open a support case and ask for an escalation on CR 7056738.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs rename query

2011-07-27 Thread Mike Gerdts

On Wed, Jul 27, 2011 at 6:37 AM, Nishchaya Bahuguna
nishchaya.bahug...@oracle.com wrote:
 Hi,

 I have a query regarding the zfs rename command.

 There are 5 zones and my requirement is to change the zone paths using zfs
 rename.

 + zoneadm list -cv
 ID NAME             STATUS     PATH                           BRAND    IP
  0 global               running    /                                 native
   shared
 34 public               running    /txzone/public              native
 shared
 35 internal             running    /txzone/internal           native
 shared
 36 restricted           running    /txzone/restricted         native
 shared
 37 needtoknow      running    /txzone/needtoknow    native   shared
 38 sandbox              running    /txzone/sandbox           native   shared

 A whole root zone public was configured and installed. Rest of the 4 zones
 were cloned from public.

 zoneadm -z zoneName clone public

 zfs get origin lists the origin as public for all 4 zones.

 I run zfs rename on 4 of these clone'd zones and it throws a device busy
 error because of parent-child relationship.

I think you are getting the device busy error for a different reason.
I just did the following:

zfs create -o mountpoint=/zones rpool/zones
zonecfg -z z1 'create; set zonepath=/zones/z1'
zoneadm -z z1 install
zonecfg -z z1c1 'create -t z1; set zonepath=/zones/z1c1'
zonecfg -z z1c2 'create -t z1; set zonepath=/zones/z1c2'
zoneadm -z z1c1 clone z1
zoneadm -z z1c2 clone z2

At this point, I have the following:

bash-3.2# zfs list -r -o name,origin rpool/zones
NAME  ORIGIN
rpool/zones   -
rpool/zones/z1-
rpool/zones/z1@SUNWzone1  -
rpool/zones/z1@SUNWzone2  -
rpool/zones/z1c1  rpool/zones/z1@SUNWzone1
rpool/zones/z1c2  rpool/zones/z1@SUNWzone2

Next, I decide that I would like z1c1 to be rpool/new/z1c1 instead of
it's current place.  Note that this will also change the mountpoint
which breaks the zone.

bash-3.2# zfs create -o mountpoint=/new rpool/new
bash-3.2# zfs rename rpool/zones/z1c1 rpool/new/z1c1
bash-3.2# zfs list -o name,origin -r /new
NAMEORIGIN
rpool/new   -
rpool/new/z1c1  rpool/zones/z1@SUNWzone1

To get a device busy error, I need to cause a situation where the
zonepath cannot be unmounted.  Having the zone running is a good way
to do that:

bash-3.2# zoneadm -z z1c2 boot
WARNING: zone z1c1 is installed, but its zonepath /zones/z1c1 does not exist.
bash-3.2# zfs rename rpool/zones/z1c2 rpool/new/z1c2
cannot unmount '/zones/z1c2': Device busy

 I guess that can be handled with zfs promote because promote would swap the
 parent and child.

You would need to do this to rename a dataset that the origin (one
that is cloned) not the clones.  That is, if you wanted to rename the
dataset for your public zone or I wanted to rename the dataset for z1,
then you would need to promote the datasets for all of the clones.
This is a known issue.

6472202 'zfs rollback' and 'zfs rename' require that clones be unmounted

 So, how do I make it work when there are multiple zones cloned from a single
 parent? Is there a way that zfs rename can work for ALL the zones rather
 than working with two zones at a time?

As I said above.


 Also, is there a command line option available for sorting the datasets in
 correct dependency order?

zfs list -r -o name,origin is a good starting point.  I suspect that
it doesn't give you exactly the output you are looking for.

FWIW, the best way to achieve what you are after without breaking the
zones is going to be along the lines of:

zlogin z1c1 init 0
zoneadm -z z1c1 detach
zfs rename rpool/zones/z1c1 rpool/new/z1c1
zoneadm -z z1c1 'set zonepath=/new/z1c1'
zoneadm -z z1c1 attach
zoneadm -z z1c1 boot

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What is .$EXTEND/$QUOTA ?

2011-07-19 Thread Mike Gerdts

On Tue, Jul 19, 2011 at 2:39 PM, Orvar Korvar
knatte_fnatte_tja...@yahoo.com wrote:
 I am using S11E, and have created a zpool on a single disk as storage. In 
 several directories, I can see a directory called  .$EXTEND/$QUOTA. What is 
 it for? Can I delete it?
 --

Perhaps this is of help.

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/smbsrv/smb_pathname.c#752

752 /*
753  * smb_pathname_preprocess_quota
754  *
755  * There is a special file required by windows so that the quota
756  * tab will be displayed by windows clients. This is created in
757  * a special directory, $EXTEND, at the root of the shared file
758  * system. To hide this directory prepend a '.' (dot).
759  */

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-15 Thread Mike Gerdts

,
 where adding a good enterprise SSD would double the
 server cost - not only on those big good systems with
 tens of GB of RAM), and hopefully simplifying the system
 configuration and maintenance - that is indeed the point
 in question.

 //Jim

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Non-Global zone recovery

2011-07-07 Thread Mike Gerdts

On Thu, Jul 7, 2011 at 2:41 PM, Ram kumar ram.kum...@gmail.com wrote:

 Hi Cindy,

 Thanks for the email.

 We are using Solaris 10 with out Live Upgrade.

 Tested following in the sandbox environment:

 1)  We have one non-global zone (TestZone)  which is running on Test 
 zpool (SAN)

 2)  Don’t see zpool or non-global zone after re-image of Global zone.

 3)  Imported zpool Test

 Now I am trying to create Non-global zone and it is giving error

 bash-3.00# zonecfg -z Test
 Test: No such zone configured
 Use 'create' to begin configuring a new zone.
 zonecfg:Test create -a /zones/Test
 invalid path to detached zone

If you use create -a, it requires that SUNWdetached.xml exist as a
means for configuring the various properties (e.g. zonepath, brand,
etc.) and resources (inherit-pkg-dir, net, fs, device, etc.) for the
zone.  Since you don't have the SUNWdetached.xml, you can't use it.

Assuming you have a backup of the system, you could restore a copy of
/etc/zones/zonename.xml to /etc/zones/restored-zonename.xml, then
run:

zonecfg -z zonename create -t restored-zonename

If that's not an option or is just too inconvenient, use zonecfg to
configure the zone just like you did initially.  That is, do not use
create -a, use create, create -b, or create -t
whateverTemplateYouUsed followed by whatever property settings and
added resources are appropriate.

After you get past zonecfg, you should be able to:

zoneadm -z zonename attach

If the package and patch levels don't match up (the global zone
perhaps was installed from a newer update or has newer patches):

zoneadm -z zonename attach -U
or
zoneadm -z zonename attach -u

Since you seem to be doing this in a test environment to prepare for
bad things to happen, I'd suggest that you make it a standard practice
when you are done configuring a zone to do:

zonecfg -z zonename export   zonepath/zonecfg.export

Then if you need to recover the zone using only the things that are on
the SAN, you can do:

zpool import ...
zonecfg -z zonename -f zonepath/zonecfg.export
zoneadm -z zonename attach [-u|-U]

Any follow-ups should probably go to Oracle Support or zones-discuss.
Your problems are not related to zfs.

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FW: Solaris panic

2011-03-17 Thread Mike Gerdts


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs-nfs-sun 7000 series

2011-03-10 Thread Mike MacNeil

Hello,
I have a Sun 7000 series NAS device, I am trying to back it up via NFS mount on 
a Solaris 10 server running Networker 7.6.1.  It works but it is extremely 
slow, I have tested other mounts and they work much faster.  The only 
difference (that I can see) between the two mounts are the underlying file 
system zfs vs ufs.  Any thoughts to speed up the backup of the Sun 7000 nfs 
mount?
Thanks you.


Mike MacNeil
Global IT Infrastructure

[cid:image001.gif@01CBDF3D.6192F090]

4281 Harvester Rd.
Burlington, ON l7l 5m4
Canada

Phone: 905 632 2999 ext.2920
Fax: 905 632 2055
Email: mike.macn...@gennum.com
www.gennum.com



This communication contains confidential information intended only for the 
addressee(s). If you have received this communication in error, please notify 
us immediately and delete this communication from your mail box.
inline: image001.gif___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] External SATA drive enclosures + ZFS?

2011-02-25 Thread Mike Tancsa

On 2/25/2011 7:34 PM, Rich Teer wrote:
 
 One product that seems to fit the bill is the StarTech.com S352U2RER,
 an external dual SATA disk enclosure with USB and eSATA connectivity
 (I'd be using the USB port).  Here's a link to the specific product
 I'm considering:
 
 http://ca.startech.com/product/S352U2RER-35in-eSATA-USB-Dual-SATA-Hot-Swap-External-RAID-Hard-Drive-Enclosure

I have had mixed results with their 4 bay version.  When they work, they
are great, but we have had a number of DOA/almost DOA units.  I have had
good luck with products from
http://www.addonics.com/
(They ship to Canada as well without issue)

Why use USB ? You wll get much better performance/throughput on eSata
(if you have good drivers of course). I use their sil3124 eSata
controller on FreeBSD as well as a number of PM units and they work great.

---Mike


-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure (solved?)

2011-02-01 Thread Mike Tancsa

On 1/31/2011 4:19 PM, Mike Tancsa wrote:
 On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
 Hi Mike,

 Yes, this is looking much better.

 Some combination of removing corrupted files indicated in the zpool
 status -v output, running zpool scrub and then zpool clear should
 resolve the corruption, but its depends on how bad the corruption is.

 First, I would try least destruction method: Try to remove the
 files listed below by using the rm command.

 This entry probably means that the metadata is corrupted or some
 other file (like a temp file) no longer exists:

 tank1/argus-data:0xc6
 
 
 Hi Cindy,
   I removed the files that were listed, and now I am left with
 
 errors: Permanent errors have been detected in the following files:
 
 tank1/argus-data:0xc5
 tank1/argus-data:0xc6
 tank1/argus-data:0xc7
 
 I have started a scrub
  scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go


Looks like that was it!  The scrub finished in the time it estimated and
that was all I needed to do. I did not have to to do zpool clear or any
other commands.  Is there anything beyond scrub to check the integrity
of the pool ?

0(offsite)# zpool status -v
  pool: tank1
 state: ONLINE
 scrub: scrub completed after 7h32m with 0 errors on Mon Jan 31 23:00:46
2011
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada5ONLINE   0 0 0
ada8ONLINE   0 0 0
ada7ONLINE   0 0 0
ada6ONLINE   0 0 0

errors: No known data errors
0(offsite)#


---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Mike Tancsa

On 1/29/2011 6:18 PM, Richard Elling wrote:
 
 On Jan 29, 2011, at 12:58 PM, Mike Tancsa wrote:
 
 On 1/29/2011 12:57 PM, Richard Elling wrote:
 0(offsite)# zpool status
 pool: tank1
 state: UNAVAIL
 status: One or more devices could not be opened.  There are insufficient
   replicas for the pool to continue functioning.
 action: Attach the missing device and online it using 'zpool online'.
  see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
 config:

   NAMESTATE READ WRITE CKSUM
   tank1   UNAVAIL  0 0 0  insufficient replicas
 raidz1ONLINE   0 0 0
   ad0 ONLINE   0 0 0
   ad1 ONLINE   0 0 0
   ad4 ONLINE   0 0 0
   ad6 ONLINE   0 0 0
 raidz1ONLINE   0 0 0
   ada4ONLINE   0 0 0
   ada5ONLINE   0 0 0
   ada6ONLINE   0 0 0
   ada7ONLINE   0 0 0
 raidz1UNAVAIL  0 0 0  insufficient replicas
   ada0UNAVAIL  0 0 0  cannot open
   ada1UNAVAIL  0 0 0  cannot open
   ada2UNAVAIL  0 0 0  cannot open
   ada3UNAVAIL  0 0 0  cannot open
 0(offsite)#

 This is usually easily solved without data loss by making the
 disks available again.  Can you read anything from the disks using
 any program?

 Thats the strange thing, the disks are readable.  The drive cage just
 reset a couple of times prior to the crash. But they seem OK now.  Same
 order as well.

 # camcontrol devlist
 WDC WD\021501FASR\25500W2B0 \200 0956  at scbus0 target 0 lun 0
 (pass0,ada0)
 WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 1 lun 0
 (pass1,ada1)
 WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 2 lun 0
 (pass2,ada2)
 WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 3 lun 0
 (pass3,ada3)


 # dd if=/dev/ada2 of=/dev/null count=20 bs=1024
 20+0 records in
 20+0 records out
 20480 bytes transferred in 0.001634 secs (12534561 bytes/sec)
 0(offsite)#
 
 The next step is to run zdb -l and look for all 4 labels. Something like:
   zdb -l /dev/ada2
 
 If all 4 labels exist for each drive and appear intact, then look more closely
 at how the OS locates the vdevs. If you can't solve the UNAVAIL problem,
 you won't be able to import the pool.
  -- richard

On 1/29/2011 10:13 PM, James R. Van Artsdalen wrote:
 On 1/28/2011 4:46 PM, Mike Tancsa wrote:

 I had just added another set of disks to my zfs array. It looks like the
 drive cage with the new drives is faulty.  I had added a couple of files
 to the main pool, but not much.  Is there any way to restore the pool
 below ? I have a lot of files on ad0,1,4,6 and ada4,5,6,7 and perhaps
 one file on the new drives in the bad cage.

 Get another enclosure and verify it works OK.  Then move the disks from
 the suspect enclosure to the tested enclosure and try to import the pool.

 The problem may be cabling or the controller instead - you didn't
 specify how the disks were attached or which version of FreeBSD you're
 using.


First off thanks to all who responded on and offlist!

Good news (for me) it seems. New cage and all seems to be recognized
correctly.  The history is

...
2010-04-22.14:27:38 zpool add tank1 raidz /dev/ada4 /dev/ada5 /dev/ada6
/dev/ada7
2010-06-11.13:49:33 zfs create tank1/argus-data
2010-06-11.13:49:41 zfs create tank1/argus-data/previous
2010-06-11.13:50:38 zfs set compression=off tank1/argus-data
2010-08-06.12:20:59 zpool replace tank1 ad1 ad1
2010-09-16.10:17:51 zpool upgrade -a
2011-01-28.11:45:43 zpool add tank1 raidz /dev/ada0 /dev/ada1 /dev/ada2
/dev/ada3

FreeBSD RELENG_8 from last week, 8G of RAM, amd64.

 zpool status -v
  pool: tank1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada5ONLINE   0 0 0
ada8ONLINE   0 0 0
ada7ONLINE   0 0 0
ada6ONLINE   0

Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Mike Tancsa

On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
 Hi Mike,
 
 Yes, this is looking much better.
 
 Some combination of removing corrupted files indicated in the zpool
 status -v output, running zpool scrub and then zpool clear should
 resolve the corruption, but its depends on how bad the corruption is.
 
 First, I would try least destruction method: Try to remove the
 files listed below by using the rm command.
 
 This entry probably means that the metadata is corrupted or some
 other file (like a temp file) no longer exists:
 
 tank1/argus-data:0xc6


Hi Cindy,
I removed the files that were listed, and now I am left with

errors: Permanent errors have been detected in the following files:

tank1/argus-data:0xc5
tank1/argus-data:0xc6
tank1/argus-data:0xc7

I have started a scrub
 scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go

I will report back once the scrub is done!

---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure

2011-01-30 Thread Mike Tancsa

On 1/30/2011 12:39 AM, Richard Elling wrote:
 Hmmm, doesnt look good on any of the drives.
 
 I'm not sure of the way BSD enumerates devices.  Some clever person thought
 that hiding the partition or slice would be useful. I don't find it useful.  
 On a Solaris
 system, ZFS can show a disk something like c0t1d0, but that doesn't exist. The
 actual data is in slice 0, so you need to use c0t1d0s0 as the argument to zdb.

I think its the right syntax.  On the older drives,


0(offsite)# zdb -l /dev/ada0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3
0(offsite)# zdb -l /dev/ada4

LABEL 0

version=15
name='tank1'
state=0
txg=44593174
pool_guid=7336939736750289319
hostid=3221266864
hostname='offsite.sentex.ca'
top_guid=6980939370923808328
guid=16144392433229115618
vdev_tree
type='raidz'
id=1
guid=6980939370923808328
nparity=1
metaslab_array=38
metaslab_shift=35
ashift=9
asize=4000799784960
is_log=0
children[0]
type='disk'
id=0
guid=16144392433229115618
path='/dev/ada4'
whole_disk=0
DTL=341
children[1]
type='disk'
id=1
guid=1210677308003674848
path='/dev/ada5'
whole_disk=0
DTL=340
children[2]
type='disk'
id=2
guid=2517076601231706249
path='/dev/ada6'
whole_disk=0
DTL=339
children[3]
type='disk'
id=3
guid=16621760039941477713
path='/dev/ada7'
whole_disk=0
DTL=338

LABEL 1

version=15
name='tank1'
state=0
txg=44592523
pool_guid=7336939736750289319
hostid=3221266864
hostname='offsite.sentex.ca'
top_guid=6980939370923808328
guid=16144392433229115618
vdev_tree
type='raidz'
id=1
guid=6980939370923808328
nparity=1
metaslab_array=38
metaslab_shift=35
ashift=9
asize=4000799784960
is_log=0
children[0]
type='disk'
id=0
guid=16144392433229115618
path='/dev/ada4'
whole_disk=0
DTL=341
children[1]
type='disk'
id=1
guid=1210677308003674848
path='/dev/ada5'
whole_disk=0
DTL=340
children[2]
type='disk'
id=2
guid=2517076601231706249
path='/dev/ada6'
whole_disk=0
DTL=339
children[3]
type='disk'
id=3
guid=16621760039941477713
path='/dev/ada7'
whole_disk=0
DTL=338

LABEL 2

version=15
name='tank1'
state=0
txg=44593174
pool_guid=7336939736750289319
hostid=3221266864
hostname='offsite.sentex.ca'
top_guid=6980939370923808328
guid=16144392433229115618
vdev_tree
type='raidz'
id=1
guid=6980939370923808328
nparity=1
metaslab_array=38
metaslab_shift=35
ashift=9
asize=4000799784960
is_log=0
children[0]
type='disk'
id=0
guid=16144392433229115618
path='/dev/ada4'
whole_disk=0
DTL=341
children[1]
type='disk'
id=1
guid=1210677308003674848
path='/dev/ada5'
whole_disk=0
DTL=340
children[2]
type='disk'
id=2
guid=2517076601231706249
path='/dev/ada6'
whole_disk=0
DTL=339
children[3]
type='disk'
id=3
guid=16621760039941477713
path='/dev/ada7'
whole_disk=0
DTL=338

Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Mike Tancsa

On 1/29/2011 12:57 PM, Richard Elling wrote:
 0(offsite)# zpool status
  pool: tank1
 state: UNAVAIL
 status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
 action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
 config:

NAMESTATE READ WRITE CKSUM
tank1   UNAVAIL  0 0 0  insufficient replicas
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada4ONLINE   0 0 0
ada5ONLINE   0 0 0
ada6ONLINE   0 0 0
ada7ONLINE   0 0 0
  raidz1UNAVAIL  0 0 0  insufficient replicas
ada0UNAVAIL  0 0 0  cannot open
ada1UNAVAIL  0 0 0  cannot open
ada2UNAVAIL  0 0 0  cannot open
ada3UNAVAIL  0 0 0  cannot open
 0(offsite)#
 
 This is usually easily solved without data loss by making the
 disks available again.  Can you read anything from the disks using
 any program?

Thats the strange thing, the disks are readable.  The drive cage just
reset a couple of times prior to the crash. But they seem OK now.  Same
order as well.

# camcontrol devlist
WDC WD\021501FASR\25500W2B0 \200 0956  at scbus0 target 0 lun 0
(pass0,ada0)
WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 1 lun 0
(pass1,ada1)
WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 2 lun 0
(pass2,ada2)
WDC WD\021501FASR\25500W2B0 \200 05.01D\0205  at scbus0 target 3 lun 0
(pass3,ada3)


# dd if=/dev/ada2 of=/dev/null count=20 bs=1024
20+0 records in
20+0 records out
20480 bytes transferred in 0.001634 secs (12534561 bytes/sec)
0(offsite)#

---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Mike Tancsa

On 1/29/2011 11:38 AM, Edward Ned Harvey wrote:
 
 That is precisely the reason why you always want to spread your mirror/raidz
 devices across multiple controllers or chassis.  If you lose a controller or
 a whole chassis, you lose one device from each vdev, and you're able to
 continue production in a degraded state...


Thanks.  These are backups of backups. It would be nice to restore them
as it will take a while to sync up once again.  But if I need to start
fresh, is there a resource you can point me to with the current best
practices for laying out large storage like this ?  Its just for backups
of backups in a DR site

---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Mike Tancsa

On 1/29/2011 6:18 PM, Richard Elling wrote:
 0(offsite)#
 
 The next step is to run zdb -l and look for all 4 labels. Something like:
   zdb -l /dev/ada2
 
 If all 4 labels exist for each drive and appear intact, then look more closely
 at how the OS locates the vdevs. If you can't solve the UNAVAIL problem,
 you won't be able to import the pool.



Hmmm, doesnt look good on any of the drives.  Before I give up, I will
try the drives in a different cage Monday. Unfortunately, its a 150km
away from me at our DR site


# zdb -l /dev/ada0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] multiple disk failure

2011-01-28 Thread Mike Tancsa

Hi,
I am using FreeBSD 8.2 and went to add 4 new disks today to expand my
offsite storage.  All was working fine for about 20min and then the new
drive cage started to fail.  Silly me for assuming new hardware would be
fine :(

The new drive cage started to fail, it hung the server and the box
rebooted.  After it rebooted, the entire pool is gone and in the state
below.  I had only written a few files to the new larger pool and I am
not concerned about restoring that data.  However, is there a way to get
back the original pool data ?
Going to http://www.sun.com/msg/ZFS-8000-3C gives a 503 error on the web
page listed BTW.


0(offsite)# zpool status
  pool: tank1
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tank1   UNAVAIL  0 0 0  insufficient replicas
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada4ONLINE   0 0 0
ada5ONLINE   0 0 0
ada6ONLINE   0 0 0
ada7ONLINE   0 0 0
  raidz1UNAVAIL  0 0 0  insufficient replicas
ada0UNAVAIL  0 0 0  cannot open
ada1UNAVAIL  0 0 0  cannot open
ada2UNAVAIL  0 0 0  cannot open
ada3UNAVAIL  0 0 0  cannot open
0(offsite)#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zpool import crashes system

2010-11-11 Thread Mike DeMarco

I am trying to bring in my zpool from build 121 into build 134 and every time I 
do a zpool import the system crashes.
  I have read other posts for this and have tried setting zfs_recover = 1 and 
aok = 1 in /etc/system I have used mdb to verify that they are in the kernel 
but the system still crashes as soon as import is called.
  On this system I can rebuild the entire pool from scratch but my next system 
is 4Tbytes and I don't have space on any other system to store that much data.
  Anyone have a way to import and upgrade a older pool to a newer OS?

TIA mic
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving the 17 zones from one LUN to another LUN

2010-10-27 Thread Mike Gerdts

On Wed, Oct 27, 2010 at 9:27 AM, bhanu prakash bhanu.sys...@gmail.com wrote:
 Hi Mike,


 Thanks for the information...

 Actually the requirement is like this. Please let me know whether it matches
 for the below requirement or not.

 Question:

 The SAN team will assign the new LUN’s on EMC DMX4 (currently IBM Hitache is
 there). We need to move the 17 containers which are existed on the
 server Host1 to new LUN’s”.


 Please give me the steps to do this activity.

Without knowing the layout of the storage, it is impossible to give
you precise instructions.  This sounds like it is a production Solaris
10 system in an enterprise environment.  In most places that I've
worked, I would be hesitant to provide the required level of detail on
a public mailing list.  Perhaps you should open a service call to get
the assistance you need.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Mike Gerdts

On Wed, Oct 27, 2010 at 3:41 PM, Harry Putnam rea...@newsguy.com wrote:
 I'm guessing it was probably more like 60 to 62 c under load.  The
 temperature I posted was after something like 5minutes of being
 totally shutdown and the case been open for a long while. (mnths if
 not yrs)

What happens if the case is closed (and all PCI slot, disk, etc. slots
are closed)?  Having the case open likely changes the way that air
flows across the various components.  Also, if there is tobacco smoke
near the machine, it will cause a sticky build-up that likely
contributes to heat dissipation problems.

Perhaps this belongs somewhere other than zfs-discuss - it has nothing
to do with zfs.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving the 17 zones from one LUN to another LUN

2010-10-26 Thread Mike Gerdts

On Tue, Oct 26, 2010 at 9:40 AM, bhanu prakash bhanu.sys...@gmail.com wrote:
 Hi Team,


 There 17 zones on the machine T5120. I want to move all the zones which are
 ZFS filesystem to another new LUN.

 Can you give me the steps to proceed this.

If the only thing on the source lun is the pool that contains the
zones and the new LUN is at least as big as the old LUN:

zpool replace pool oldlun newlun

The above can be done while the zones are booted.  Depending on the
characteristics of the server and workloads, the workloads may feel a
bit sluggish during this time due to increased I/O activity.  If that
works for you, stop reading now.

In the event that the scenario above doesn't apply, read on.  Assuming
all the zones are under oldpool/zones, oldpool/zones is mounted at
/zones, and you have done zpool create newpool newlun

Be sure to test this procedure - I didn't!

zfs create newlun/zones

# optionally, shut down the zones
zfs snapshot -r oldpool/zo...@phase1
zfs send -r oldpool/zo...@phase1 | zfs receive newpool/zo...@phase1

# If you did not shut down the zones above, shut them down now.
# If the zones were shut down, skip the next two commands
zfs snapshot -r oldpool/zo...@phase2
zfs send -rI oldpool/zo...@phase1 oldpool/zo...@phase2 \
| zfs receive newpool/zo...@phase2

# Adjust mount points and restart the zones
zfs set mountpoint=none oldpool/zones
zfs set mountpoint=/zones newpool/zones
for zone in $zonelist zoneadm -z $zone boot ; done

At such a time that you are comfortable that the zone data moved over ok...

zfs destroy -r oldpool/zones


Again, verify the procedure works on a test/lab/whatever box before
trying it for real.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] making sense of arcstat.pl output

2010-10-01 Thread Mike Harsch

For posterity, I'd like to point out the following:

neel's original arcstat.pl uses a crude scaling routine that results in a large 
loss of precision as numbers cross from Kilobytes to Megabytes to Gigabytes.  
The 1G reported arc size case described here, could actually be anywhere 
between 1,000,000MB and 1,999,999MB.  Use 'kstat zfs::arcstats' to read the arc 
size directly from the kstats (for comparison). 

I've updated arcstat.pl with a better scaling routine that returns more 
appropriate results (similar to df -h human-readable output).  I've also added 
support for L2ARC stats.  The updated version can be found here:

http://github.com/mharsch/arcstat
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] making sense of arcstat.pl output

2010-10-01 Thread Mike Harsch

Hello Christian,

Thanks for bringing this to my attention.  I believe I've fixed the rounding 
error in the latest version.

http://github.com/mharsch/arcstat
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] making sense of arcstat.pl output

2010-10-01 Thread Mike Harsch

przemol,

Thanks for the feedback.  I had incorrectly assumed that any machine running 
the script would have L2ARC implemented (which is not the case with Solaris 
10).  I've added a check for this that allows the script to work on non-L2ARC 
machines as long as you don't specify L2ARC stats on the command line.

http://github.com/mharsch/arcstat
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] file level clones

2010-09-27 Thread Mike Gerdts

On Mon, Sep 27, 2010 at 6:23 AM, Robert Milkowski mi...@task.gda.pl wrote:
snip
 Also see http://www.symantec.com/connect/virtualstoreserver

And 
http://blog.scottlowe.org/2008/12/03/2031-enhancements-to-netapp-cloning-technology/


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))

2010-09-23 Thread Mike.



On 9/23/2010 at 12:38 PM Erik Trimble wrote:

| [snip]
|If you don't really care about ultra-low-power, then there's
absolutely 
|no excuse not to buy a USED server-class machine which is 1- or 2- 
|generations back.  They're dirt cheap, readily available, 
| [snip]
 =



Anyone have a link or two to a place where I can buy some dirt-cheap,
readily available last gen servers?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] recordsize

2010-09-16 Thread Mike DeMarco

What are the ramifications to changing the recordsize of a zfs filesystem that 
already has data on it?

I want to tune down the recordsize to speed up very small reads to a size that 
is more in line with the read size. can I do this on a filestystem that has 
data already on it and how does it effect that data? zpool consists of 8 SANs 
Luns.

Thanks
mike
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac OS X clients with ZFS server

2010-09-16 Thread Mike Mackovitch

On Thu, Sep 16, 2010 at 08:15:53AM -0700, Rich Teer wrote:
 On Thu, 16 Sep 2010, Erik Ableson wrote:
 
  OpenSolaris snv129
 
 Hmm, SXCE snv_130 here.  Did you have to do any server-side tuning
 (e.g., allowing remote connections), or did it just work out of the
 box?  I know that Sendmail needs some gentle persuasion to accept
 remote connections out of the box; perhaps lockd is the same?

So, you've been having this problem since April.
Did you ever try getting packet traces to see where the problem is?

As I previously stated, if you want, you can forward the traces to me to
look at.  Let me know if you need the directions on how to capture them.

--macko
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac OS X clients with ZFS server

2010-09-15 Thread Mike Mackovitch

On Wed, Sep 15, 2010 at 12:08:20PM -0700, Nabil wrote:
 any resolution to this issue?  I'm experiencing the same annoying
 lockd thing with mac osx 10.6 clients.  I am at pool ver 14, fs ver
 3.  Would somehow going back to the earlier 8/2 setup make things
 better?

As noted in the earlier thread, the annoying lockd thing is not a
ZFS issue, but rather a networking issue.

FWIW, I never saw a resolution.  But the suggestions for how to debug
situations like this still stand:

 So, it looks like you need to investigate why the client isn't
 getting responses from the server's lockd.

 This is usually caused by a firewall or NAT getting in the way.

 I would also check /var/log/system.log and /var/log/kernel.log on the Mac to
 see if any other useful messages are getting logged.

 Then I'd grab packet traces with wireshark/tcpdump/snoop *simultaneously* on
 the client and the server, reproduce the problem, and then determine which
 packets are being sent and which packets are being received.

HTH
--macko
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to migrate to 4KB sector drives?

2010-09-12 Thread Mike Gerdts

On Sun, Sep 12, 2010 at 5:42 PM, Richard Elling rich...@nexenta.com wrote:
 On Sep 12, 2010, at 10:11 AM, Brandon High wrote:

 On Sun, Sep 12, 2010 at 10:07 AM, Orvar Korvar
 knatte_fnatte_tja...@yahoo.com wrote:
 No replies. Does this mean that you should avoid large drives with 4KB 
 sectors, that is, new drives? ZFS does not handle new drives?

 Solaris 10u9 handles 4k sectors, so it might be in a post-b134 release of 
 osol.

 OSol source yes, binaries no :-(  You will need another distro besides 
 OpenSolaris.
 The needed support in sd was added around the b137 timeframe.

OpenIndiana, to be released on Tuesday, is based on b146 or later.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-28 Thread Mike Gerdts

On Sat, Aug 28, 2010 at 8:19 AM, Ray Van Dolson rvandol...@esri.com wrote:
 On Sat, Aug 28, 2010 at 05:50:38AM -0700, Eff Norwood wrote:
 I can't think of an easy way to measure pages that have not been consumed 
 since it's really an SSD controller function which is obfuscated from the 
 OS, and add the variable of over provisioning on top of that. If anyone 
 would like to really get into what's going on inside of an SSD that makes it 
 a bad choice for a ZIL, you can start here:

 http://en.wikipedia.org/wiki/TRIM_%28SSD_command%29

 and

 http://en.wikipedia.org/wiki/Write_amplification

 Which will be more than you might have ever wanted to know. :)

 So has anyone on this list actually run into this issue?  Tons of
 people use SSD-backed slog devices...

 The theory sounds sound, but if it's not really happening much in
 practice then I'm not too worried.  Especially when I can replace a
 drive from my slog mirror for a $400 or so if problems do arise... (the
 alternative being much more expensive DRAM backed devices)

Presumably this problem is being worked...

http://hg.genunix.org/onnv-gate.hg/rev/d560524b6bb6

Notice that it implements:

866610  Add SATA TRIM support

With this in place, I would imagine a next step is for zfs to issue
TRIM commands as zil entries have been committed to the data disks.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Halcyon ZFS and system monitoring software for OpenSolaris (beta)

2010-08-25 Thread Mike Kirk

Update: version 3.2.5 out now, with changes to better support snv_134:

http://forums.halcyoninc.com/showthread.php?t=368

If you've downloaded v3.2.4 and are on 09/06, there is no reason to upgrade.

Regards,

mike.k...@halcyoninc.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Halcyon ZFS and system monitoring software for OpenSolaris (beta)

2010-08-20 Thread Mike Kirk

Hi zfs user,

 Is the beta free? for how long? if not how much for 5 machines?

Everything on our web site (including the beta) runs for 30 days with the 
baked-in license. After 30 days it will stop collecting fresh numbers, unless 
you add a license key, or a demo extension file from the sales team (or 
reinstall it and start over again).

 If you are going to post about your commercial products - please include 
 some price points, so people know whether to ignore the info based on 
 their budget. 

You're right, it would be nice if people could just go to our version of 
shop.oracle.com, but we're not there yet, and I don't have the price sheets 
the sales guys do to put those numbers in the forum. 

If you're still interested, please email me and I can put you in touch with 
someone who can directly deal with your pricing questions, without going 
through our web page or the sales alias etc.

Thanks!

mike.k...@halcyoninc.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] EMC migration and zfs

2010-08-16 Thread Mike DeMarco

Bump this up. Anyone?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New Supermicro SAS/SATA controller: AOC-USAS2-L8e in SOHO NAS and HD HTPC

2010-08-16 Thread Mike DeMarco

What I would really like to know is why do pci-e raid controller cards cost 
more than an entire motherboard with processor. Some cards can cost over $1,000 
dollars, for what.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving /export to another zpool

2010-08-13 Thread Mike Gerdts

On Fri, Aug 13, 2010 at 1:07 PM, Handojo hando...@yahoo.com wrote:
 Are the old /opt and /expore still listed in your
 vfstab(4) file?

 I cant access /etc/vfstab because I can't even log in as my username. I can't 
 even log in as root from the Login Screen

 And when I boot on using LiveCD, how can I mount my first drive that has 
 opensolaris installed ?

To list the zpools it can see:

zpool import

To import one called rpool at an alternate root:

zpool import -R /mnt rpool


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS development moving behind closed doors

2010-08-13 Thread Mike M



On 8/13/2010 at 8:56 PM Eric D. Mudama wrote:

|On Fri, Aug 13 at 19:06, Frank Cusack wrote:
|Interesting POV, and I agree.  Most of the many distributions of
|OpenSolaris had very little value-add.  Nexenta was the most
interesting
|and why should Oracle enable them to build a business at their
expense?
|
|These distributions are, in theory, the gateway drug where people
|can experiment inexpensively to try out new technologies (ZFS, dtrace,
|crossbow, comstar, etc.) and eventually step up to Oracle's big iron
|as their business grows.
 =

Think: strategic business advantage.  

Oracle are not stupid, they recognize a jewel when they see one.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs allow does not work for rpool

2010-07-28 Thread Mike DeMarco

I am trying to give a general user permissions to create zfs filesystems in the 
rpool.

zpool set=delegation=on rpool
zfs allow user create rpool

both run without any issues.

zfs allow rpool reports the user does have create permissions.

zfs create rpool/test
cannot create rpool/test : permission denied.

Can you not allow to the rpool?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs allow does not work for rpool

2010-07-28 Thread Mike DeMarco

Thanks adding mount did allow me to create it but does not allow me to create 
the mountpoint.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS performance?

2010-07-26 Thread Mike Gerdts

On Mon, Jul 26, 2010 at 1:27 AM, Garrett D'Amore garr...@nexenta.com wrote:
 On Sun, 2010-07-25 at 21:39 -0500, Mike Gerdts wrote:
 On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore garr...@nexenta.com wrote:
  On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote:
 
  I think there may be very good reason to use iSCSI, if you're limited
  to gigabit but need to be able to handle higher throughput for a
  single client. I may be wrong, but I believe iSCSI to/from a single
  initiator can take advantage of multiple links in an active-active
  multipath scenario whereas NFS is only going to be able to take
  advantage of 1 link (at least until pNFS).
 
  There are other ways to get multiple paths.  First off, there is IP
  multipathing. which offers some of this at the IP layer.  There is also
  802.3ad link aggregation (trunking).  So you can still get high
  performance beyond  single link with NFS.  (It works with iSCSI too,
  btw.)

 With both IPMP and link aggregation, each TCP session will go over the
 same wire.  There is no guarantee that load will be evenly balanced
 between links when there are multiple TCP sessions.  As such, any
 scalability you get using these configurations will be dependent on
 having a complex enough workload, wise cconfiguration choices, and and
 a bit of luck.

 If you're really that concerned, you could use UDP instead of TCP.  But
 that may have other detrimental performance impacts, I'm not sure how
 bad they would be in a data center with generally lossless ethernet
 links.

Heh.  My horror story with reassembly was actually with connectionless
transports (LLT, then UDP).  Oracle RAC's cache fusion sends 8 KB
blocks via UDP by default, or LLT when used in the Veritas + Oracle
RAC certified configuration from 5+ years ago.  The use of Sun
trunking with round robin hashing and the lack of use of jumbo packets
made every cache fusion block turn into 6 LLT or UDP packets that had
to be reassembled on the other end.  This was on a 15K domain with the
NICs spread across IO boards.  I assume that interrupts for a NIC are
handled by a CPU on the closest system board (Solaris 8, FWIW).  If
that assumption is true then there would also be a flurry of
inter-system board chatter to put the block back together.  In any
case, performance was horrible until we got rid of round robin and
enabled jumbo frames.

 Btw, I am not certain that the multiple initiator support (mpxio) is
 necessarily any better as far as guaranteed performance/balancing.  (It
 may be; I've not looked closely enough at it.)

I haven't paid close attention to how mpxio works.  The Veritas
analog, vxdmp, does a very good job of balancing traffic down multiple
paths, even when only a single LUN is accessed.  The exact mode that
dmp will use is dependent on the capabilities of the array it is
talking to - many arrays work in an active/passive mode.  As such, I
would expect that with vxdmp or mpxio the balancing with iSCSI would
be at least partially dependent on what the array said to do.

 I should look more closely at NFS as well -- if multiple applications on
 the same client are access the same filesystem, do they use a single
 common TCP session, or can they each have separate instances open?
 Again, I'm not sure.

It's worse than that.  A quick experiment with two different
automounted home directories from the same NFS server suggests that
both home directories share one TCP session to the NFS server.

The latest version of Oracle's RDBMS supports a userland NFS client
option.  It would be very interesting to see if this does a separate
session per data file, possibly allowing for better load spreading.

 Note that with Sun Trunking there was an option to load balance using
 a round robin hashing algorithm.  When pushing high network loads this
 may cause performance problems with reassembly.

 Yes.  Reassembly is Evil for TCP performance.

 Btw, the iSCSI balancing act that was described does seem a bit
 contrived -- a single initiator and a COMSTAR server, both client *and
 server* with multiple ethernet links instead of a single 10GbE link.

 I'm not saying it doesn't happen, but I think it happens infrequently
 enough that its reasonable that this scenario wasn't one that popped
 immediately into my head. :-)

It depends on whether the people that control the network gear are the
same ones that control servers.  My experience suggests that if there
is a disconnect, it seems rather likely that each group's
standardization efforts, procurement cycles, and capacity plans will
work against any attempt to have an optimal configuration.

Also, it is rather common to have multiple 1 Gb links to servers going
to disparate switches so as to provide resilience in the face of
switch failures.  This is not unlike (at a block diagram level) the
architecture that you see in pretty much every SAN.  In such a
configuation, it is reasonable for people to expect that load
balancing will occur.

-- 
Mike Gerdts
http

Re: [zfs-discuss] NFS performance?

2010-07-26 Thread Mike Gerdts

On Mon, Jul 26, 2010 at 2:56 PM, Miles Nordin car...@ivy.net wrote:
 mg == Mike Gerdts mger...@gmail.com writes:
    mg it is rather common to have multiple 1 Gb links to
    mg servers going to disparate switches so as to provide
    mg resilience in the face of switch failures.  This is not unlike
    mg (at a block diagram level) the architecture that you see in
    mg pretty much every SAN.  In such a configuation, it is
    mg reasonable for people to expect that load balancing will
    mg occur.

 nope.  spanning tree removes all loops, which means between any two
 points there will be only one enabled path.  An L2-switched network
 will look into L4 headers for splitting traffic across an aggregated
 link (as long as it's been deliberately configured to do that---by
 default probably only looks to L2), but it won't do any multipath
 within the mesh.

I was speaking more of IPMP, which is at layer 3.

 Even with an L3 routing protocol it usually won't do multipath unless
 the costs of the paths match exactly, so you'd want to build the
 topology to achieve this and then do all switching at layer 3 by
 making sure no VLAN is larger than a switch.

By default, IPMP does outbound load spreading.  Inbound load spreading
is not practical with a single (non-test) IP address.  If you have
multiple virtual IP's you can spread them across all of the NICs in
the IPMP group and get some degree of inbound spreading as well.  This
is the default behavior of the OpenSolaris IPMP implementation, last I
looked.  I've not seen any examples (although I can't say I've looked
real hard either) of the Solaris 10 IPMP configuration set up with
multipe IP's to encourage inbound load spreading as well.


 There's actually a cisco feature to make no VLAN larger than a *port*,
 which I use a little bit.  It's meant for CATV networks I think, or
 DSL networks aggregated by IP instead of ATM like maybe some European
 ones?  but the idea is not to put edge ports into vlans any more but
 instead say 'ip unnumbered loopbackN', and then some black magic they
 have built into their DHCP forwarder adds /32 routes by watching the
 DHCP replies.  If you don't use DHCP you can add static /32 routes
 yourself, and it will work.  It does not help with IPv6, and also you
 can only use it on vlan-tagged edge ports (what? arbitrary!) but
 neat that it's there at all.

  http://www.cisco.com/en/US/docs/ios/12_3t/12_3t4/feature/guide/gtunvlan.html

Interesting... however this seems to limit you to  4096 edge ports
per VTP domain, as the VID field in the 802.1q header is only 12 bits.
 It is also unclear how this works when you have one physical host
with many guests.  And then there is the whole thing that I don't
really see how this helps with resilience in the face of a switch
failure.  Cool technology, but I'm not certain that it addresses what
I was talking about.


 The best thing IMHO would be to use this feature on the edge ports,
 just as I said, but you will have to teach the servers to VLAN-tag
 their packets.  not such a bad idea, but weird.

 You could also use it one hop up from the edge switches, but I think
 it might have problems in general removing the routes when you unplug
 a server, and using it one hop up could make them worse.  I only use
 it with static routes so far, so no mobility for me: I have to keep
 each server plugged into its assigned port, and reconfigure switches
 if I move it.  Once you have ``no vlan larger than 1 switch,'' if you
 actually need a vlan-like thing that spans multiple switches, the new
 word for it is 'vrf'.

There was some other Cisco dark magic that our network guys were
touting a while ago that would make each edge switch look like a blade
in a 6500 series.  This would then allow them to do link aggregation
across edge switches.  At least two of organizational changes,
personnel changes, and roadmap changes happened so I've not seen
this in action.


 so, yeah, it means the server people will have to take over the job of
 the networking people.  The good news is that networking people don't
 like spanning tree very much because it's always going wrong, so
 AFAICT most of them who are paying attention are already moving in
 this direction.

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss





-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS performance?

2010-07-25 Thread Mike Gerdts

On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore garr...@nexenta.com wrote:
 On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote:

 I think there may be very good reason to use iSCSI, if you're limited
 to gigabit but need to be able to handle higher throughput for a
 single client. I may be wrong, but I believe iSCSI to/from a single
 initiator can take advantage of multiple links in an active-active
 multipath scenario whereas NFS is only going to be able to take
 advantage of 1 link (at least until pNFS).

 There are other ways to get multiple paths.  First off, there is IP
 multipathing. which offers some of this at the IP layer.  There is also
 802.3ad link aggregation (trunking).  So you can still get high
 performance beyond  single link with NFS.  (It works with iSCSI too,
 btw.)

With both IPMP and link aggregation, each TCP session will go over the
same wire.  There is no guarantee that load will be evenly balanced
between links when there are multiple TCP sessions.  As such, any
scalability you get using these configurations will be dependent on
having a complex enough workload, wise cconfiguration choices, and and
a bit of luck.

Note that with Sun Trunking there was an option to load balance using
a round robin hashing algorithm.  When pushing high network loads this
may cause performance problems with reassembly.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hashing files rapidly on ZFS

2010-07-07 Thread Mike Gerdts

On Tue, Jul 6, 2010 at 10:29 AM, Arne Jansen sensi...@gmx.net wrote:
 Daniel Carosone wrote:
 Something similar would be useful, and much more readily achievable,
 from ZFS from such an application, and many others.  Rather than a way
 to compare reliably between two files for identity, I'ld liek a way to
 compare identity of a single file between two points in time.  If my
 application can tell quickly that the file content is unaltered since
 last time I saw the file, I can avoid rehashing the content and use a
 stored value. If I can achieve this result for a whole directory
 tree, even better.

 This would be great for any kind of archiving software. Aren't zfs checksums
 already ready to solve this? If a file changes, it's dnodes' checksum changes,
 the checksum of the directory it is in and so forth all the way up to the
 uberblock.
 There may be ways a checksum changes without a real change in the files 
 content,
 but the other way round should hold. If the checksum didn't change, the file
 didn't change.
 So the only missing link is a way to determine zfs's checksum for a
 file/directory/dataset. Am I missing something here? Of course atime update
 should be turned off, otherwise the checksum will get changed by the archiving
 agent.

What is the likelihood that the same data is re-written to the file?
If that is unlikely, it looks as though znode_t's z_seq may be useful.
 While it isn't a checksum, it seems to be incremented on every file
change.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expected throughput

2010-07-04 Thread Mike Gerdts

On Sun, Jul 4, 2010 at 11:28 AM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:

 Ok... so we've rebuilt the pool as 14 pairs of mirrors, each pair having
 one disk in each of the two JBODs.  Now we're getting about 500-1000 IOPS
 (according to zpool iostat) and 20-30MB/sec in random read on a big
 database.  Does that sounds right?

 I am not sure who wrote the above text since the attribution quoting is all
 botched up (Gmail?) in this thread.  Regardless, it is worth pointing out
 that 'zpool iostat' only reports the I/O operations which were actually
 performed.  It will not report the operations which did not need to be
 performed due to already being in cache.  A quite busy system can still
 report very little via 'zpool iostat' if it has enough RAM to cache the
 requested data.

 Bob

Very good point.  You can use a combination of zpool iostat and
fsstat to see the effect of reads that didn't turn into physical I/Os.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expected throughput

2010-07-04 Thread Mike Gerdts

On Sun, Jul 4, 2010 at 10:08 AM, Ian D rewar...@hotmail.com wrote:
 What I don't understand is why, when I run a single query I get 100 IOPS
 and 3MB/sec.  The setup can obviously do better, so where is the
 bottleneck?  I don't see any CPU core on any side being maxed out so it
 can't be it...

In what way is CPU contention being monitored?  prstat without
options is nearly useless for a multithreaded app on a multi-CPU (or
multi-core/multi-thread) system.  mpstat is only useful if threads
never migrate between CPU's.  prstat -mL gives a nice picture of how
busy each LWP (thread) is.

When viewed with prstat -mL, A thread that has usr+sys at 100%
cannot go any faster, unless you can get the CPU to go faster, as I
suggest below. From my understanding (perhaps not 100% correct on the
rest of this paragraph):  The time spent in TRP may be reclaimed by
running the application in a processor set with interrupts disabled on
all of its processors.  If TFL or DFL are high, optimizing the use of
cache may be beneficial.  Examples of how you can optimize the use of
cache include using the FX scheduler with a priority that gives
relatively long time slices, using processor sets to keep other
processes off of the same caches (which are often shared by multiple
cores), or perhaps disabling CPU's (threads) to ensure that only a
single core is using each cache.  With current generation Intel CPU's,
this can allow the CPU clock rate to increase, thereby allowing more
work to get done.

 The database is MySQL, it runs on a Linux box that connects to the Nexenta

Oh, since the database runs on Linux I guess you need to dig up top's
equivalent of prstat -mL.  Unfortunately, I don't think that Linux
has microstate accounting and as such you may not have visibility into
time spent on traps, text faults, and data faults on a per-process
basis.

 server through 10GbE using iSCSI.

Have you done any TCP tuning?  Based on the numbers you cite above, it
looks like you are doing about 32 KB I/O's.  I think you can perform a
test that involves mainly the network if you use netperf with options
like:

netperf -H $host -t TCP_RR -r 32768 -l 30

That is speculation based on reading
http://www.netperf.org/netperf/training/Netperf.html.  Someone else
(perhaps on networking or performance lists) may have better tests to
run.

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expected throughput

2010-07-04 Thread Mike Gerdts

On Sun, Jul 4, 2010 at 2:08 PM, Ian D rewar...@hotmail.com wrote:

 Mem:  74098512k total, 73910728k used,   187784k free,    96948k buffers
 Swap:  2104488k total,      208k used,  2104280k free, 63210472k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 17652 mysql     20   0 3553m 3.1g 5472 S   38  4.4 247:51.80 mysqld
 16301 mysql     20   0 4275m 3.3g 5980 S    4  4.7   5468:33 mysqld
 16006 mysql     20   0 4434m 3.3g 5888 S    3  4.6   5034:06 mysqld
 12822 root      15  -5     0    0    0 S    2  0.0  22:00.50 scsi_wq_39

Is that 38% of one CPU or 38% of all CPU's?  How many CPU's does the
Linux box have?  I don't mean the number of sockets, I mean number of
sockets * number of cores * number of threads per core.  My
recollection of top is that the CPU percentage is:

(pcpu_t2 - pcpu_t1) / (interval * ncpus)

Where pcpu_t* is the process CPU time at a particular time.  If you
have a two socket quad core box with hyperthreading enabled, that is 2
* 4 * 2 = 16 CPU's.  38% of 16 CPU's can be roughly 6 CPU's running as
fast as they can (and 10 of them idle) or 16 CPU's each running at
about 38%.  In the I don't have a CPU bottleneck argument, there is
a big difference.

If PID 16301 has a single thread that is doing significant work, on
the hypothetical 16 CPU box this means that it is spending about 2/3
of the time on CPU.  If the workload does:

while ( 1 ) {
issue I/O request
get response
do cpu-intensive work work
}

It is only trying to do I/O 1/3 of the time.  Further, it has put a
single high latency operation between its bursts of CPU activity.

One other area of investigation that I didn't mention before: Your
stats imply that the Linux box is getting data 32 KB at a time.  How
does 32 KB compare to the database block size?  How does 32 KB compare
to the block size on the relevant zfs filesystem or zvol?  Are blocks
aligned at the various layers?

http://blogs.sun.com/dlutz/entry/partition_alignment_guidelines_for_unified

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Use of blocksize (-b) during zfs zvol create, poor performance

2010-06-30 Thread Mike La Spina

Hi Eff,

There are a significant number of variables to work through with dedup and 
compression enabled. So the first suggestion I have is to disable those 
features for now so your not working with too many elements. 

With those features set aside an NTFS cluster operation does not = a 64k raw 
I/O block. As well the ZFS 64k blocksize does not = one I/O operation. We may 
also need to consider the overall network performance behavior and iSCSI 
protocol characteristics and the Windows network stack.

iperf is a good tool to rule that out.

What I primarily suspect the issue may be is that write I/O operations are not 
aligned and are waiting for a I/O completion over multiple vdevs. Alignment is 
important for write I/O optimization and how the I/O maps at the software raid 
mode will make a significant impact to the DMU and SPA operations on a specific 
vdev layout. You may also have an issue with write cache operations,  by 
default large I/O calls such as 64K will not use a ZIL cache vdev, if you have 
one defined, but will be written directly to your array vdevs which will also 
include a transaction group write operation. 

To ensure ZIL log usage with 64k I/O's you can apply the following: 
edit the /etc/system file with  

set zfs:zfs_immediate_write_sz = 131071

a reboot is required to activate the system file

You have also not indicated what your zpool configuration looks like, that 
would helpful in the discussion area. 

It appears that your applying the x4500 as a backup target which means you 
should (if not already) enable write caching on the COMSTAR LU properties for 
this type of application.

e.g
stmfadm modify-lu -p wcd=false 600144F02F2280004C1D62010001

To help triage the perf issue further you could post 2 'kstat zfs' + 2 'kstat 
stmf' outputs on a 5 min interval and a 'zpool iostat -v 30 5' which would help 
visualize the I/O behavior. 

Regards,

Mike

http://blog.laspina.ca/
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] COMSTAR ISCSI - configuration export/import

2010-06-28 Thread Mike Devlin

I havnt tried it yet, but supposedly this will backup/restore the
comstar config:

$ svccfg export -a stmf  ⁠comstar⁠.bak.${DATE}

If you ever need to restore the configuration, you can attach the
storage and run an import:

$ svccfg import ⁠comstar⁠.bak.${DATE}


- Mike

On 6/28/10, bso...@epinfante.com bso...@epinfante.com wrote:
 Hi all,

 Having osol b134 exporting a couple of iscsi targets to some hosts,how can
 the COMSTAR configuration be migrated to other host?
 I can use the ZFS send/receive to replicate the luns but how can I
 replicate the target,views from serverA to serverB ?

 Is there any best procedures to follow to accomplish this?
 Thanks for all your time,

 Bruno

 Sent from my HTC
 --
 This message has been scanned for viruses and
 dangerous content by MailScanner, and is
 believed to be clean.



-- 
Sent from my mobile device
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] VXFS to ZFS Quota

2010-06-18 Thread Mike Gerdts

On Fri, Jun 18, 2010 at 8:09 AM, David Magda dma...@ee.ryerson.ca wrote:
 You could always split things up into groups of (say) 50. A few jobs ago,
 I was in an environment where we have a /home/students1/ and
 /home/students2/, along with a separate faculty/ (using Solaris and UFS).
 This had more to do with IOps than anything else.

A decade or so ago when I managed similar environments and had (I
think) 6 file systems handling about 5000 students.  Each file system
had about 1/6 of the students.  Challenges I found in this were:

- Students needed to work on projects together.  The typical way to do
this was for them to request a group, then create a group writable
directory in one of their home directories.  If all students in the
group had home directories on the same file system, there was nothing
special to consider.  If they were on different file systems then at
least one would need to have a non-zero quota (that is, not 0 blocks
soft, 1 block hard) quota on the file system where the group directory
resides.
- Despite your best efforts things will get imbalanced.  If you are
tight on space, this means that you will need to migrate users.  This
will become apparent only at the times of the semester where even
per-user outages are most inconvenient (i.e. at 6 and 13 weeks when
big projects tend to be due).

Its probably a good idea to consider these types of situations in the
transition plan, or at least determine they don't apply.  I was
working in a college of engineering where group projects were common
and CAD, EDA, and simulation tools could generate big files very
quickly.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Mike Gerdts

On Tue, Jun 15, 2010 at 7:28 PM, David Magda dma...@ee.ryerson.ca wrote:
 On Jun 15, 2010, at 14:20, Fco Javier Garcia wrote:

 I think dedup may have its greatest appeal in VDI environments (think
 about a environment with 85% if the data that the virtual machine needs is
 into ARC or L2ARC... is like a dream...almost instantaneous response... and
 you can boot a new machine in a few seconds)...

 This may also be accomplished by using snapshots and clones of data sets. At
 least for OS images: user profiles and documents could be something else
 entirely.

It all depends on the nature of the VDI environment.  If the VMs are
regenerated on each login, the snapshot + clone mechanism is
sufficient.  Deduplication is not needed.  However, if VMs have a long
life and get periodic patches and other software updates,
deduplication will be required if you want to remain at somewhat
constant storage utilization.

It probably makes a lot of sense to be sure that swap or page files
are on a non-dedup dataset.  Executables and shared libraries
shouldn't be getting paged out to it and the likelihood that multiple
VMs page the same thing to swap or a page file is very small.

 Another situation that comes to mind is perhaps as the back-end to a mail
 store: if you send out a message(s) with an attachment(s) to a lot of
 people, the attachment blocks could be deduped (and perhaps compressed as
 well, since base-64 adds 1/3 overhead).

It all depends on how this is stored.  If the attachments are stored
like they were in 1990 as part of an mbox format, you will be very
unlikely to get the proper block alignment.  Even storing the message
body (including headers) in the same file as the attachment may not
align the attachments because the mail headers may be different (e.g.
different recipients messages took different paths, some were
forwarded, etc.).  If the attachments are stored in separate files or
a database format is used that stores attachments separate from the
message (with matching database + zfs block size) things may work out
favorably.

However, a system that detaches messages and stores them separately
may just as well store them in a file that matches the SHA256 hash,
assuming that file doesn't already exist.  If does exist, it can just
increment a reference count.  In other words, an intelligent mail
system should already dedup.  Or at least that is how I would have
written it for the last decade or so...

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Mike Gerdts

On Thu, Jun 10, 2010 at 9:39 AM, Andrey Kuzmin
andrey.v.kuz...@gmail.com wrote:
 On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl wrote:

 On 21/10/2009 03:54, Bob Friesenhahn wrote:

 I would be interested to know how many IOPS an OS like Solaris is able to
 push through a single device interface.  The normal driver stack is likely
 limited as to how many IOPS it can sustain for a given LUN since the driver
 stack is optimized for high latency devices like disk drives.  If you are
 creating a driver stack, the design decisions you make when requests will be
 satisfied in about 12ms would be much different than if requests are
 satisfied in 50us.  Limitations of existing software stacks are likely
 reasons why Sun is designing hardware with more device interfaces and more
 independent devices.


 Open Solaris 2009.06, 1KB READ I/O:

 # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0

 /dev/null is usually a poor choice for a test lie this. Just to be on the
 safe side, I'd rerun it with /dev/random.
 Regards,
 Andrey

(aside from other replies about read vs. write and /dev/random...)

Testing performance of disk by reading from /dev/random and writing to
disk is misguided.  From random(7d):

   Applications retrieve random bytes by reading /dev/random
   or /dev/urandom. The /dev/random interface returns random
   bytes only when sufficient amount of entropy has been collected.

In other words, when the kernel doesn't think that it can give high
quality random numbers, it stops providing them until it has gathered
enough entropy.  It will pause your reads.

If instead you use /dev/urandom, the above problem doesn't exist, but
the generation of random numbers is CPU-intensive.  There is a
reasonable chance (particularly with slow CPU's and fast disk) that
you will be testing the speed of /dev/urandom rather than the speed of
the disk or other I/O components.

If your goal is to provide data that is not all 0's to prevent ZFS
compression from making the file sparse or want to be sure that
compression doesn't otherwise make the actual writes smaller, you
could try something like:

# create a file just over 100 MB
dd if=/dev/random of=/tmp/randomdata bs=513 count=204401
# repeatedly feed that file to dd
while true ; do cat /tmp/randomdataa ; done | dd of=/my/test/file
bs=... count=...

The above should make it so that it will take a while before there are
two blocks that are identical, thus confounding deduplication as well.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

2010-05-31 Thread Mike Gerdts

On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness san...@van-ness.com wrote:
 On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:
 There are multiple factors at work.  Your OpenSolaris should be new
 enough to have the fix in which the zfs I/O tasks are run in in a
 scheduling class at lower priority than normal user processes.
 However, there is also a throttling mechanism for processes which
 produce data faster than can be consumed by the disks.  This
 throttling mechanism depends on the amount of RAM available to zfs and
 the write speed of the I/O channel.  More available RAM results in
 more write buffering, which results in a larger chunk of data written
 at the next transaction group write interval.  The maximum size of a
 transaction group may be configured in /etc/system similar to:

 * Set ZFS maximum TXG group size to 2684354560
 set zfs:zfs_write_limit_override = 0xa000

 If the transaction group is smaller, then zfs will need to write more
 often.  Processes will still be throttled but the duration of the
 delay should be smaller due to less data to write in each burst.  I
 think that (with multiple writers) the zfs pool will be healthier
 and less fragmented if you can offer zfs more RAM and accept some
 stalls during writing.  There are always tradeoffs.

 Bob
 well it seems like when messing with the txg sync times and stuff like
 that it did make the transfer more smooth but didn't actually help with
 speeds as it just meant the hangs happened for a shorter time but at a
 smaller interval and actually lowering the time between writes just
 seemed to make things worse (slightly).

 I think I have came to the conclusion that the problem here is CPU due
 to the fact that its only doing this with parity raid. I would think if
 it was I/O based then it would be the same as if anything its heavier on
 I/O on non parity raid due to the fact that it is no longer CPU
 bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
 parity raidz2).

To see if the CPU is pegged, take a look at the output of:

mpstat 1
prstat -mLc 1

If mpstat shows that the idle time reaches 0 or the process' latency
column is more then a few tenths of a percent, you are probably short
on CPU.

It could also be that interrupts are stealing cycles from rsync.
Placing it in a processor set with interrupts disabled in that
processor set may help.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

2010-05-31 Thread Mike Gerdts

Sorry, turned on html mode to avoid gmail's line wrapping.

On Mon, May 31, 2010 at 4:58 PM, Sandon Van Ness san...@van-ness.comwrote:

 On 05/31/2010 02:52 PM, Mike Gerdts wrote:
  On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness san...@van-ness.com
 wrote:
 
  On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:
 
  There are multiple factors at work.  Your OpenSolaris should be new
  enough to have the fix in which the zfs I/O tasks are run in in a
  scheduling class at lower priority than normal user processes.
  However, there is also a throttling mechanism for processes which
  produce data faster than can be consumed by the disks.  This
  throttling mechanism depends on the amount of RAM available to zfs and
  the write speed of the I/O channel.  More available RAM results in
  more write buffering, which results in a larger chunk of data written
  at the next transaction group write interval.  The maximum size of a
  transaction group may be configured in /etc/system similar to:
 
  * Set ZFS maximum TXG group size to 2684354560
  set zfs:zfs_write_limit_override = 0xa000
 
  If the transaction group is smaller, then zfs will need to write more
  often.  Processes will still be throttled but the duration of the
  delay should be smaller due to less data to write in each burst.  I
  think that (with multiple writers) the zfs pool will be healthier
  and less fragmented if you can offer zfs more RAM and accept some
  stalls during writing.  There are always tradeoffs.
 
  Bob
 
  well it seems like when messing with the txg sync times and stuff like
  that it did make the transfer more smooth but didn't actually help with
  speeds as it just meant the hangs happened for a shorter time but at a
  smaller interval and actually lowering the time between writes just
  seemed to make things worse (slightly).
 
  I think I have came to the conclusion that the problem here is CPU due
  to the fact that its only doing this with parity raid. I would think if
  it was I/O based then it would be the same as if anything its heavier on
  I/O on non parity raid due to the fact that it is no longer CPU
  bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
  parity raidz2).
 
  To see if the CPU is pegged, take a look at the output of:
 
  mpstat 1
  prstat -mLc 1
 
  If mpstat shows that the idle time reaches 0 or the process' latency
  column is more then a few tenths of a percent, you are probably short
  on CPU.
 
  It could also be that interrupts are stealing cycles from rsync.
  Placing it in a processor set with interrupts disabled in that
  processor set may help.
 
 

 Unfortunately none of these utilies make it possible to ge values for 1
 second which is what the hang is (its happening for about 1/2 of a second).


 Here is with mpstat:


snip - bad line wrapped lines removed



 Here is what i get with prstat:


snip - removed first interval  fixed formatting of next


 Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
 PROCESS/LWPID
   604 root 0.0  33 0.0 0.0 0.0 0.0  42  25  18  13   0   0
 zpool-data/13
   604 root 0.0  30 0.0 0.0 0.0 0.0  41  29  12  12   0   0
 zpool-data/15
  1326 root  12 2.9 0.0 0.0 0.0 0.0  85 0.4  1K  12 11K   0 rsync/1
   604 root 0.0  15 0.0 0.0 0.0 0.0  41  44 111   9   0   0
 zpool-data/27
   604 root 0.0  14 0.0 0.0 0.0 0.0  43  42  72   3   0   0
 zpool-data/33
   604 root 0.0 5.9 0.0 0.0 0.0 0.0  41  53 109   6   0   0
 zpool-data/19
   604 root 0.0 5.4 0.0 0.0 0.0 0.0  42  53 106   8   0   0
 zpool-data/25
   604 root 0.0 5.3 0.0 0.0 0.0 0.0  43  51 107   7   0   0
 zpool-data/21
   604 root 0.0 4.5 0.0 0.0 0.0 0.0  41  54 110   4   0   0
 zpool-data/31
   604 root 0.0 3.9 0.0 0.0 0.0 0.0  41  55 109   3   0   0
 zpool-data/23
   604 root 0.0 3.7 0.0 0.0 0.0 0.0  44  52 111   2   0   0
 zpool-data/29
  1322 root 0.0 0.4 0.0 0.0 0.0 0.0  98 2.0  1K   0   1   0 rsync/1
  22644 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0  16  13 255   0 prstat/1
  14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   5   3  69   0 sshd/1
   196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  15   2 105   0 nscd/17


In the interval above, it looks as though rsync is getting spending a very
small amount of time waiting for a CPU (LAT or latency), but various zfs
threads spent quite a bit of time (relative to the interval) waiting for a
CPU.  Compared to the next interval, rsync spent very little time on cpu
(usr + sys): 14.9% vs. 55%, 58%, etc.).  Perhaps it is not being fed data
quickly enough because of CPU contention that prevents timely transferring
of data from the NIC to rsync.


Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
 PROCESS/LWPID
  1326 root  44  11 0.0 0.0 0.0 0.0  44 1.3  5K  24 42K   0 rsync/1
  1322 root 0.0 1.4 0.0 0.0 0.0 0.0  93 5.9  5K   0   0   0

Re: [zfs-discuss] Is it safe to disable the swap partition?

2010-05-09 Thread Mike Gerdts

On Sun, May 9, 2010 at 7:40 PM, Edward Ned Harvey
solar...@nedharvey.com wrote:

  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Richard Elling
 
  For a storage server, swap is not needed. If you notice swap being used
  then your storage server is undersized.

 Indeed, I have two solaris 10 fileservers that have uptime in the range of a
 few months.  I just checked swap usage, and they're both zero.

 So, Bob, rub it in if you wish.  ;-)  I was wrong.  I knew the behavior in
 Linux, which Roy seconded as most OSes, and apparently we both assumed the
 same here, but that was wrong.  I don't know if solaris and opensolaris both
 have the same swap behavior.  I don't know if there's *ever* a situation
 where solaris/opensolaris would swap idle processes.  But there's at least
 evidence that my two servers have not, or do not.

If Solaris is under memory pressure, pages may be paged to swap.
Under severe memory pressure, entire processes may be swapped.  This
will happen after freeing up the memory used for file system buffers,
ARC, etc.  If the processes never page in the pages that have been
paged out (or the processes that have been swapped out are never
scheduled) then those pages will not consume RAM.

The best thing to do with processes that can be swapped out forever is
to not run them.

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac OS X clients with ZFS server

2010-04-22 Thread Mike Mackovitch

On Thu, Apr 22, 2010 at 12:40:37PM -0700, Rich Teer wrote:
 On Thu, 22 Apr 2010, Tomas Ögren wrote:
 
  Copying via terminal (and cp) works.
 
 Interesting: if I copy a file *which has no extended attributes* using cp in
 a terminal, it works fine.  If I try to cp a file that has EA (to the same
 destination), it hangs.  But I get this error message after a few seconds:
 
 cp file_without_EA /net/zen/export/home/rich
 cp file_with_EA /net/zen/export/home/rich
 nfs server zen:/export/home: lockd not responding

So, it looks like you need to investigate why the client isn't
getting responses from the server's lockd.

This is usually caused by a firewall or NAT getting in the way.

HTH
--macko
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac OS X clients with ZFS server

2010-04-22 Thread Mike Mackovitch

On Thu, Apr 22, 2010 at 01:54:26PM -0700, Rich Teer wrote:
 On Thu, 22 Apr 2010, Mike Mackovitch wrote:
 
 Hi Mike,
 
  So, it looks like you need to investigate why the client isn't
  getting responses from the server's lockd.
  
  This is usually caused by a firewall or NAT getting in the way.
 
 Great idea--I was indeed connected to my network using the AirPort interface,
 thorugh a Wifi router.  So as an experiment, I tried using a hard-wired,
 manually set up Ethernet connection.  Same result: no dice.  :-(
 
 I checked the firewall settings on my laptop, and the firewall is turned off.
 
 Do you have any other ideas?  It'd be really nice to get this working!

I would also check /var/log/system.log and /var/log/kernel.log on the Mac to
see if any other useful messages are getting logged.

Then I'd grab packet traces with wireshark/tcpdump/snoop *simultaneously* on
the client and the server, reproduce the problem, and then determine which
packets are being sent and which packets are being received.

HTH
--macko
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Why does ARC grow above hard limit?

2010-04-05 Thread Mike Z

I would appreciate if somebody can clarify  a  few points.

I am doing some random WRITES  (100% writes, 100% random) testing and observe 
that ARC grows way beyond the hard limit during the test. The hard limit is 
set 512 MB via /etc/system and I see the size going up to 1 GB - how come is it 
happening?

mdb's ::memstat reports 1.5 GB used - does this include ARC as well or is it 
separate?

I see on the backed only reads (205 MB/s) and almost no writes (1.1 MB/s) - any 
ides what is being read?

--- BEFORE TEST 
# ~/bin/arc_summary.pl

System Memory:
 Physical RAM:  12270 MB
 Free Memory :  7108 MB
 LotsFree:  191 MB

ZFS Tunables (/etc/system):
 set zfs:zfs_prefetch_disable = 1
 set zfs:zfs_arc_max = 0x2000
 set zfs:zfs_arc_min = 0x1000

ARC Size:
 Current Size: 136 MB (arcsize)
 Target Size (Adaptive):   512 MB (c)
 Min Size (Hard Limit):256 MB (zfs_arc_min)
 Max Size (Hard Limit):512 MB (zfs_arc_max)
...


 ::memstat
Page SummaryPagesMB  %Tot
     
Kernel 800895  3128   25%
ZFS File Data  394450  1540   13%
Anon   106813   4173%
Exec and libs4178160%
Page cache  14333550%
Free (cachelist)22996891%
Free (freelist)   1797511  7021   57%

Total 3141176 12270
Physical  3141175 12270


--- DURING THE TEST
# ~/bin/arc_summary.pl 
System Memory:
 Physical RAM:  12270 MB
 Free Memory :  6687 MB
 LotsFree:  191 MB

ZFS Tunables (/etc/system):
 set zfs:zfs_prefetch_disable = 1
 set zfs:zfs_arc_max = 0x2000
 set zfs:zfs_arc_min = 0x1000

ARC Size:
 Current Size: 1336 MB (arcsize)
 Target Size (Adaptive):   512 MB (c)
 Min Size (Hard Limit):256 MB (zfs_arc_min)
 Max Size (Hard Limit):512 MB (zfs_arc_max)

ARC Size Breakdown:
 Most Recently Used Cache Size:  87%446 MB (p)
 Most Frequently Used Cache Size:12%65 MB (c-p)

ARC Efficency:
 Cache Access Total: 51681761
 Cache Hit Ratio:  52%   27056475   [Defined State for 
buffer]
 Cache Miss Ratio: 47%   24625286   [Undefined State for 
Buffer]
 REAL Hit Ratio:   52%   27056475   [MRU/MFU Hits Only]

 Data Demand   Efficiency:35%
 Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable)

CACHE HITS BY CACHE LIST:
  Anon:   --%Counter Rolled.
  Most Recently Used: 13%3627289 (mru)  [ 
Return Customer ]
  Most Frequently Used:   86%23429186 (mfu) [ 
Frequent Customer ]
  Most Recently Used Ghost:   17%4657584 (mru_ghost)[ 
Return Customer Evicted, Now Back ]
  Most Frequently Used Ghost: 32%8712009 (mfu_ghost)[ 
Frequent Customer Evicted, Now Back ]
CACHE HITS BY DATA TYPE:
  Demand Data:30%8308866 
  Prefetch Data:   0%0 
  Demand Metadata:69%18747609 
  Prefetch Metadata:   0%0 
CACHE MISSES BY DATA TYPE:
  Demand Data:61%15113029 
  Prefetch Data:   0%0 
  Demand Metadata:38%9511898 
  Prefetch Metadata:   0%359 
-
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS behavior under limited resources

2010-04-02 Thread Mike Z

I am trying to see how ZFS behaves under resource starvation - corner cases in 
embedded environments. I see some very strange behavior. Any help/explanation 
would really be appreciated.

My current setup is :
OpenSolaris 111b (iSCSI seems to be broken in 132 - unable to get multiple 
connections/mutlipathing)
iSCSI Storage Array that is capable of 
20 MB/s random writes @ 4k and 70 MB random reads @ 4k
150 MB/s random writes @ 128k and 180 MB/S random reads @ 128K
180+ MB/S for sequntial reads and write at both 4k and 128k.
8 Intel CPU and 12 GB of RAM (DELL poweredge 610)

The ARC size is limited to 512MB (hard limit). No L2 Cache.

In both test below the file system size is about 300 GB. This file system 
conatins a single directory  with about 15'000 files totalling to 200 GB (so 
the file system is 2/3 full). The tests are run within the same directory.

Test 1:
Random writes @ 4k to 1000 1MB files (1000 threads, 1 per file).

First I observe that ARC size grows (momentarily) above 512 MB limit (via kstat 
and arcstat.pl).
Q: It seems that zfs:zfs_arc_max is not really a hard limit?

I tried setting primarycache to none, metadata and all. The I/O reported is 
similar in the NONE and METADATA case (17 MB/S) while when set to ALL, I/O is 3 
- 4 time less (4-5 MB/S).
Q: Any explanation would be useful.

In this test I observe for backend on average I/O is 132 MB/s for READs and 51 
MB/s WRITES
Q: Why is more read than wtritten?

Test 2:
Random writes @ 4k to 10'000 1MB files (10'000 threads, 1 per file).

- ARC size now goes to 1 GB during the entire test (way above the hard limit)

- ::memstat reports that zfs grew from the original 430 MB to about 1.5 GB
Q: Does mdb memstat reporting include ARC?

Q: On the backend I see 170 MB/s reads and 0.5 MB.s writes -- What is happening 
here?



SOME sample output ...

---
 ::memstat
Page SummaryPagesMB  %Tot
     
Kernel 800933  3128   25%
ZFS File Data  394450  1540   13%
Anon   128909   5034%
Exec and libs4172160%
Page cache  14749570%
Free (cachelist)21884851%
Free (freelist)   1776079  6937   57%

Total 3141176 12270
Physical  3141175 12270

--
System Memory:
 Physical RAM:  12270 MB
 Free Memory :  6966 MB
 LotsFree:  191 MB

ZFS Tunables (/etc/system):
 set zfs:zfs_prefetch_disable = 1
 set zfs:zfs_arc_max = 0x2000
 set zfs:zfs_arc_min = 0x1000

ARC Size:
 Current Size: 669 MB (arcsize)
 Target Size (Adaptive):   512 MB (c)
 Min Size (Hard Limit):256 MB (zfs_arc_min)
 Max Size (Hard Limit):512 MB (zfs_arc_max)

ARC Size Breakdown:
 Most Recently Used Cache Size:   6%32 MB (p)
 Most Frequently Used Cache Size:93%480 MB (c-p)

ARC Efficency:
 Cache Access Total: 47002757
 Cache Hit Ratio:  52%   24657634   [Defined State for 
buffer]
 Cache Miss Ratio: 47%   22345123   [Undefined State for 
Buffer]
 REAL Hit Ratio:   52%   24657634   [MRU/MFU Hits Only]

 Data Demand   Efficiency:36%
 Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable)

CACHE HITS BY CACHE LIST:
  Anon:   --%Counter Rolled.
  Most Recently Used: 13%3420349 (mru)  [ 
Return Customer ]
  Most Frequently Used:   86%21237285 (mfu) [ 
Frequent Customer ]
  Most Recently Used Ghost:   16%4057965 (mru_ghost)[ 
Return Customer Evicted, Now Back ]
  Most Frequently Used Ghost: 31%7837353 (mfu_ghost)[ 
Frequent Customer Evicted, Now Back ]
CACHE HITS BY DATA TYPE:
  Demand Data:31%7793822 
  Prefetch Data:   0%0 
  Demand Metadata:68%16863812 
  Prefetch Metadata:   0%0 
CACHE MISSES BY DATA TYPE:
  Demand Data:60%13573358 
  Prefetch Data:   0%0 
  Demand Metadata:39%8771406 
  Prefetch Metadata:   0%359 
-
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs diff

2010-03-29 Thread Mike Gerdts

On Mon, Mar 29, 2010 at 5:39 PM, Nicolas Williams
nicolas.willi...@sun.com wrote:
 One really good use for zfs diff would be: as a way to index zfs send
 backups by contents.

Or to generate the list of files for incremental backups via NetBackup
or similar.  This is especially important for file systems will
millions of files with relatively few changes.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-20 Thread Mike Gerdts

On Fri, Mar 19, 2010 at 11:57 PM, Edward Ned Harvey
solar...@nedharvey.com wrote:
 1. NDMP for putting zfs send streams on tape over the network.  So

 Tell me if I missed something here.  I don't think I did.  I think this
 sounds like crazy talk.

 I used NDMP up till November, when we replaced our NetApp with a Solaris Sun
 box.  In NDMP, to choose the source files, we had the ability to browse the
 fileserver, select files, and specify file matching patterns.  My point is:
 NDMP is file based.  It doesn't allow you to spawn a process and backup a
 data stream.

 Unless I missed something.  Which I doubt.  ;-)

5+ years ago the variety of NDMP that was available with the
combination of NetApp's OnTap and Veritas NetBackup did backups at the
volume level.  When I needed to go to tape to recover a file that was
no longer in snapshots, we had to find space on a NetApp to restore
the volume.  It could not restore the volume to a Sun box, presumably
because the contents of the backup used a data stream format that was
proprietary to NetApp.

An expired Internet Draft for NDMPv4 says:

  butype_name
 Specifies the name of the backup method to be used for the
 transfer (dump, tar, cpio, etc). Backup types are
NDMP Server
 implementation dependent and MUST match one of the Data
 Server implementation specific butype_name
strings accessible
 via the NDMP_CONFIG_GET_BUTYPE_INFO request.

http://www.ndmp.org/download/sdk_v4/draft-skardal-ndmp4-04.txt

It seems pretty clear from this that an NDMP data stream can contain
most anything and is dependent on the device being backed up.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-18 Thread Mike Gerdts

On Wed, Mar 17, 2010 at 9:15 AM, Edward Ned Harvey
solar...@nedharvey.com wrote:
 I think what you're saying is:  Why bother trying to backup with zfs
 send
 when the recommended practice, fully supportable, is to use other tools
 for
 backup, such as tar, star, Amanda, bacula, etc.   Right?

 The answer to this is very simple.
 #1  ...
 #2  ...

 Oh, one more thing.  zfs send is only discouraged if you plan to store the
 data stream and do zfs receive at a later date.

 If instead, you are doing zfs send | zfs receive onto removable media, or
 another server, where the data is immediately fed through zfs receive then
 it's an entirely viable backup technique.

Richard Elling made an interesting observation that suggests that
storing a zfs send data stream on tape is a quite reasonable thing to
do.  Richard's background makes me trust his analysis of this much
more than I trust the typical person that says that zfs send output is
poison.

http://opensolaris.org/jive/thread.jspa?messageID=465973tstart=0#465861

I think that a similar argument could be made for storing the zfs send
data streams on a zfs file system.  However, it is not clear why you
would do this instead of just zfs send | zfs receive.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [OT] excess zfs-discuss mailman digests

2010-02-08 Thread Mike Gerdts

On Mon, Feb 8, 2010 at 9:04 PM, grarpamp grarp...@gmail.com wrote:
 PS: Is there any way to get a copy of the list since inception
 for local client perusal, not via some online web interface?

You can get monthly .gz archives in mbox format from
http://mail.opensolaris.org/pipermail/zfs-discuss/.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-25 Thread Mike Gerdts

On Mon, Jan 25, 2010 at 2:32 AM, Kjetil Torgrim Homme
kjeti...@linpro.no wrote:
 Mike Gerdts mger...@gmail.com writes:

 John Hoogerdijk wrote:
 Is there a way to zero out unused blocks in a pool?  I'm looking for
 ways to shrink the size of an opensolaris virtualbox VM and using the
 compact subcommand will remove zero'd sectors.

 I've long suspected that you should be able to just use mkfile or dd
 if=/dev/zero ... to create a file that consumes most of the free
 space then delete that file.  Certainly it is not an ideal solution,
 but seems quite likely to be effective.

 you'll need to (temporarily) enable compression for this to have an
 effect, AFAIK.

 (dedup will obviously work, too, if you dare try it.)

You are missing the point.  Compression and dedup will make it so that
the blocks in the devices are not overwritten with zeroes.  The goal
is to overwrite the blocks so that a back-end storage device or
back-end virtualization platform can recognize that the blocks are not
in use and as such can reclaim the space.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-23 Thread Mike Gerdts

On Sat, Jan 23, 2010 at 11:55 AM, John Hoogerdijk
john.hoogerd...@sun.com wrote:
 Mike Gerdts wrote:

 On Fri, Jan 22, 2010 at 1:00 PM, John Hoogerdijk
 john.hoogerd...@sun.com wrote:


 Is there a way to zero out unused blocks in a pool?  I'm looking for ways
 to
 shrink the size of an opensolaris virtualbox VM and
 using the compact subcommand will remove zero'd sectors.


 I've long suspected that you should be able to just use mkfile or dd
 if=/dev/zero ... to create a file that consumes most of the free
 space then delete that file.  Certainly it is not an ideal solution,
 but seems quite likely to be effective.


 I tried this with mkfile - no joy.

Let me ask a couple of the questions that come just after are you
sure your computer is plugged in?

Did you wait enough time for the data to be flushed to disk (or do
sync and wait for it to complete) prior to removing the file?

You did mkfile $huge /var/tmp/junk not mkfile -n $huge /var/tmp/junk, right?

If not, I suspect that zpool replace to a thin provisioned disk is
going to be your best bet (as suggested in another message).

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-22 Thread Mike Gerdts

On Thu, Jan 21, 2010 at 11:28 AM, Richard Elling
richard.ell...@gmail.com wrote:
 On Jan 21, 2010, at 3:55 AM, Julian Regel wrote:
  Until you try to pick one up and put it in a fire safe!

 Then you backup to tape from x4540 whatever data you need.
 In case of enterprise products you save on licensing here as you need a one 
 client license per x4540 but in fact can backup data from many clients 
 which are there.

 Which brings up full circle...

 What do you then use to backup to tape bearing in mind that the Sun-provided 
 tools all have significant limitations?

 Poor choice of words.  Sun resells NetBackup and (IIRC) that which was
 formerly called NetWorker.  Thus, Sun does provide enterprise backup
 solutions.

(Symantec nee Veritas) NetBackup and (EMC nee Legato) Networker are
different products that compete in the enterprise backup space.

Under the covers NetBackup uses gnu tar to gather file data for the
backup stream.  At one point (maybe still the case), one of the
claimed features of netbackup is that if a tape is written without
multiplexing, you can use gnu tar to extract data.  This seems to be
most useful when you need to recover master and/or media servers and
to be able to extract your data after you no longer use netbackup.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-22 Thread Mike Gerdts

On Fri, Jan 22, 2010 at 1:00 PM, John Hoogerdijk
john.hoogerd...@sun.com wrote:
 Is there a way to zero out unused blocks in a pool?  I'm looking for ways to
 shrink the size of an opensolaris virtualbox VM and
 using the compact subcommand will remove zero'd sectors.

I've long suspected that you should be able to just use mkfile or dd
if=/dev/zero ... to create a file that consumes most of the free
space then delete that file.  Certainly it is not an ideal solution,
but seems quite likely to be effective.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup memory overhead

2010-01-21 Thread Mike Gerdts

On Thu, Jan 21, 2010 at 2:51 PM, Andrey Kuzmin
andrey.v.kuz...@gmail.com wrote:
 Looking at dedupe code, I noticed that on-disk DDT entries are
 compressed less efficiently than possible: key is not compressed at
 all (I'd expect roughly 2:1 compression ration with sha256 data),

A cryptographic hash such as sha256 should not be compressible.  A
trivial example shows this to be the case:

for i in {1..1} ; do
echo $i | openssl dgst -sha256 -binary
done  /tmp/sha256

$ gzip -c sha256 sha256.gz
$ compress -c sha256 sha256.Z
$ bzip2 -c sha256 sha256.bz2

$ ls -go sha256*
-rw-r--r--   1  32 Jan 22 04:13 sha256
-rw-r--r--   1  428411 Jan 22 04:14 sha256.Z
-rw-r--r--   1  321846 Jan 22 04:14 sha256.bz2
-rw-r--r--   1  320068 Jan 22 04:14 sha256.gz

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-19 Thread Mike La Spina

I use zfs send/recv in the enterprise and in smaller environments all time and 
it's is excellent.

Have a look at how awesome the functionally is in this example.

http://blog.laspina.ca/ubiquitous/provisioning_disaster_recovery_with_zfs

Regards,

Mike
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-16 Thread Mike Gerdts

On Sat, Jan 16, 2010 at 5:31 PM, Toby Thain t...@telegraphics.com.au wrote:
 On 16-Jan-10, at 7:30 AM, Edward Ned Harvey wrote:

 I am considering building a modest sized storage system with zfs. Some
 of the data on this is quite valuable, some small subset to be backed
 up forever, and I am evaluating back-up options with that in mind.

 You don't need to store the zfs send data stream on your backup media.
 This would be annoying for the reasons mentioned - some risk of being able
 to restore in future (although that's a pretty small risk) and inability
 to
 restore with any granularity, i.e. you have to restore the whole FS if you
 restore anything at all.

 A better approach would be zfs send and pipe directly to zfs receive
 on
 the external media.  This way, in the future, anything which can read ZFS
 can read the backup media, and you have granularity to restore either the
 whole FS, or individual things inside there.

 There have also been comments about the extreme fragility of the data stream
 compared to other archive formats. In general it is strongly discouraged for
 these purposes.


Yet it is used in ZFS flash archives on Solaris 10 and are slated for
use in the successor to flash archives.  This initial proposal seems
to imply using the same mechanism for a system image backup (instead
of just system provisioning).

http://mail.opensolaris.org/pipermail/caiman-discuss/2010-January/015909.html

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 6:55 AM, Darren J Moffat darr...@opensolaris.org wrote:
 Frank Batschulat (Home) wrote:

 This just can't be an accident, there must be some coincidence and thus
 there's a good chance
 that these CHKSUM errors must have a common source, either in ZFS or in
 NFS ?

 What are you using for on the wire protection with NFS ?  Is it shared using
 krb5i or do you have IPsec configured ?  If not I'd recommend trying one of
 those and see if your symptoms change.

Shouldn't a scrub pick that up?  Why would there be no errors from
zoneadm install, which under the covers does a pkg image create
followed by *multiple* pkg install invocations.  No checksum errors
pop up there.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 6:51 AM, James Carlson carls...@workingcode.com wrote:
 Frank Batschulat (Home) wrote:
 This just can't be an accident, there must be some coincidence and thus 
 there's a good chance
 that these CHKSUM errors must have a common source, either in ZFS or in NFS ?

 One possible cause would be a lack of substantial exercise.  The man
 page says:

         A regular file. The use of files as a backing  store  is
         strongly  discouraged.  It  is  designed  primarily  for
         experimental purposes, as the fault tolerance of a  file
         is  only  as  good  as  the file system of which it is a
         part. A file must be specified by a full path.

 Could it be that discouraged and experimental mean not tested as
 thoroughly as you might like, and certainly not a good idea in any sort
 of production environment?

 It sounds like a bug, sure, but the fix might be to remove the option.

This unsupported feature is supported with the use of Sun Ops Center
2.5 when a zone is put on a NAS Storage Library.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 5:28 AM, Frank Batschulat (Home)
frank.batschu...@sun.com wrote:
[snip]
 Hey Mike, you're not the only victim of these strange CHKSUM errors, I hit
 the same during my slightely different testing, where I'm NFS mounting an
 entire, pre-existing remote file living in the zpool on the NFS server and use
 that to create a zpool and install zones into it.

What does your overall setup look like?

Mine is:

T5220 + Sun System Firmware 7.2.4.f 2009/11/05 18:21
   Primary LDom
  Solaris 10u8
  Logical Domains Manager 1.2,REV=2009.06.25.09.48 + 142840-03
  Guest Domain 4 vcpus + 15 GB memory
 OpenSolaris snv_130
(this is where the problem is observed)

I've seen similar errors on Solaris 10 in the primary domain and on a
M4000.  Unfortunately Solaris 10 doesn't show the checksums in the
ereport.  There I noticed a mixture between read errors and checksum
errors - and lots more of them.  This could be because the S10 zone
was a full root SUNWCXall compared to the much smaller default ipkg
branded zone.  On the primary domain running Solaris 10...

(this command was run some time ago)
primary-domain# zpool status myzone
  pool: myzone
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
myzone  DEGRADED 0 0 0
  /foo/20g  DEGRADED 4.53K 0   671  too many errors

errors: No known data errors


(this was run today, many days after previous command)
primary-domain# fmdump -eV | egrep zio_err | uniq -c | head
   1zio_err = 5
   1zio_err = 50
   1zio_err = 5
   1zio_err = 50
   1zio_err = 5
   1zio_err = 50
   2zio_err = 5
   1zio_err = 50
   3zio_err = 5
   1zio_err = 50


Note that even though I had thousands of read errors the zone worked
just fine. I would have never known (suspected?) there was a problem
if I hadn't run zpool status or the various FMA commands.


 I've filed today:

 6915265 zpools on files (over NFS) accumulate CKSUM errors with no apparent 
 reason

Thanks.  I'll open a support call to help get some funding on it...

 here's the relevant piece worth investigating out of it (leaving out the 
 actual setup etc..)
 as in your case, creating the zpool and installing the zone into it still 
 gives
 a healthy zpool, but immediately after booting the zone, the zpool served 
 over NFS
 accumulated CHKSUM errors.

 of particular interest are the 'cksum_actual' values as reported by Mike for 
 his
 test case here:

 http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg33041.html

 if compared to the 'chksum_actual' values I got in the fmdump error output on 
 my test case/system:

 note, the NFS servers zpool that is serving and sharing the file we use is 
 healthy.

 zone halted now on my test system, and checking fmdump:

 osoldev.batschul./export/home/batschul.= fmdump -eV | grep cksum_actual | 
 sort | uniq -c | sort -n | tail
   2    cksum_actual = 0x4bea1a77300 0xf6decb1097980 0x217874c80a8d9100 
 0x7cd81ca72df5ccc0
   2    cksum_actual = 0x5c1c805253 0x26fa7270d8d2 0xda52e2079fd74 
 0x3d2827dd7ee4f21
   6    cksum_actual = 0x28e08467900 0x479d57f76fc80 0x53bca4db5209300 
 0x983ddbb8c4590e40
 *A   6    cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 
 0x89715e34fbf9cdc0
 *B   7    cksum_actual = 0x0 0x0 0x0 0x0
 *C  11    cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 
 0x280934efa6d20f40
 *D  14    cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 
 0x7e0aef335f0c7f00
 *E  17    cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 
 0xd4f1025a8e66fe00
 *F  20    cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 
 0x7f84b11b3fc7f80
 *G  25    cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 
 0x82804bc6ebcfc0

 osoldev.root./export/home/batschul.= zpool status -v
  pool: nfszone
  state: DEGRADED
 status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: none requested
 config:

        NAME        STATE     READ WRITE CKSUM
        nfszone     DEGRADED     0     0     0
          /nfszone  DEGRADED     0     0   462  too many errors

 errors: No known data errors

 ==

 now compare this with Mike's error output as posted here:

 http://www.mail-archive.com/zfs

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 9:11 AM, Mike Gerdts mger...@gmail.com wrote:
 I've seen similar errors on Solaris 10 in the primary domain and on a
 M4000.  Unfortunately Solaris 10 doesn't show the checksums in the
 ereport.  There I noticed a mixture between read errors and checksum
 errors - and lots more of them.  This could be because the S10 zone
 was a full root SUNWCXall compared to the much smaller default ipkg
 branded zone.  On the primary domain running Solaris 10...

I've written a dtrace script to get the checksums on Solaris 10.
Here's what I see with NFSv3 on Solaris 10.

# zoneadm -z zone1 halt ; zpool export pool1 ; zpool import -d
/mnt/pool1 pool1 ; zoneadm -z zone1 boot ; sleep 30 ; pkill dtrace

# ./zfs_bad_cksum.d
Tracing...
dtrace: error on enabled probe ID 9 (ID 43443:
fbt:zfs:zio_checksum_error:return): invalid address (0x301b363a000) in
action #4 at DIF offset 20
dtrace: error on enabled probe ID 9 (ID 43443:
fbt:zfs:zio_checksum_error:return): invalid address (0x3037f746000) in
action #4 at DIF offset 20
cccdtrace:
error on enabled probe ID 9 (ID 43443:
fbt:zfs:zio_checksum_error:return): invalid address (0x3026e7b) in
action #4 at DIF offset 20
cc
Checksum errors:
   3 : 0x130e01011103 0x20108 0x0 0x400 (fletcher_4_native)
   3 : 0x220125cd8000 0x62425980c08 0x16630c08296c490c
0x82b320c082aef0c (fletcher_4_native)
   3 : 0x2f2a0a202a20436f 0x7079726967687420 0x2863292032303031
0x2062792053756e20 (fletcher_4_native)
   3 : 0x3c21444f43545950 0x452048544d4c2050 0x55424c494320222d
0x2f2f5733432f2f44 (fletcher_4_native)
   3 : 0x6005a8389144 0xc2080e6405c200b6 0x960093d40800
0x9eea007b9800019c (fletcher_4_native)
   3 : 0xac044a6903d00163 0xa138c8003446 0x3f2cd1e100b10009
0xa37af9b5ef166104 (fletcher_4_native)
   3 : 0xbaddcafebaddcafe 0xc 0x0 0x0 (fletcher_4_native)
   3 : 0xc4025608801500ff 0x1018500704528210 0x190103e50066
0xc34b90001238f900 (fletcher_4_native)
   3 : 0xfe00fc01fc42fc42 0xfc42fc42fc42fc42 0xfffc42fc42fc42fc
0x42fc42fc42fc42fc (fletcher_4_native)
   4 : 0x4b2a460a 0x0 0x4b2a460a 0x0 (fletcher_4_native)
   4 : 0xc00589b159a00 0x543008a05b673 0x124b60078d5be
0xe3002b2a0b605fb3 (fletcher_4_native)
   4 : 0x130e010111 0x32000b301080034 0x10166cb34125410
0xb30c19ca9e0c0860 (fletcher_4_native)
   4 : 0x130e010111 0x3a201080038 0x104381285501102
0x418016996320408 (fletcher_4_native)
   4 : 0x130e010111 0x3a201080038 0x1043812c5501102
0x81802325c080864 (fletcher_4_native)
   4 : 0x130e010111 0x3a0001c01080038 0x1383812c550111c
0x818975698080864 (fletcher_4_native)
   4 : 0x1f81442e9241000 0x2002560880154c00 0xff10185007528210
0x19010003e566 (fletcher_4_native)
   5 : 0xbab10c 0xf 0x53ae 0xdd549ae39aa1ba20 (fletcher_4_native)
   5 : 0x130e010111 0x3ab01080038 0x1163812c550110b
0x8180a7793080864 (fletcher_4_native)
   5 : 0x61626300 0x0 0x0 0x0 (fletcher_4_native)
   5 : 0x8003 0x3df0d6a1 0x0 0x0 (fletcher_4_native)
   6 : 0xbab10c 0xf 0x5384 0xdd549ae39aa1ba20 (fletcher_4_native)
   7 : 0xbab10c 0xf 0x0 0x9af5e5f61ca2e28e (fletcher_4_native)
   7 : 0x130e010111 0x3a201080038 0x104381265501102
0xc18c7210c086006 (fletcher_4_native)
   7 : 0x275c222074650a2e 0x5c222020436f7079 0x7269676874203139
0x38392041540a2e5c (fletcher_4_native)
   8 : 0x130e010111 0x3a0003101080038 0x1623812c5501131
0x8187f66a4080864 (fletcher_4_native)
   9 : 0x8a000801010c0682 0x2eed0809c1640513 0x70200ff00026424
0x18001d16101f0059 (fletcher_4_native)
  12 : 0xbab10c 0xf 0x0 0x45a9e1fc57ca2aa8 (fletcher_4_native)
  30 : 0xbaddcafebaddcafe 0xbaddcafebaddcafe 0xbaddcafebaddcafe
0xbaddcafebaddcafe (fletcher_4_native)
  47 : 0x0 0x0 0x0 0x0 (fletcher_4_native)
  92 : 0x130e01011103 0x10108 0x0 0x200 (fletcher_4_native)

Since I had to guess at what the Solaris 10 source looks like, some
extra eyeballs on the dtrace script is in order.

Mike

-- 
Mike Gerdts
http://mgerdts.blogspot.com/


zfs_bad_cksum.d
Description: Binary data
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 12:28 PM, Torrey McMahon tmcmah...@yahoo.com wrote:
 On 1/8/2010 10:04 AM, James Carlson wrote:

 Mike Gerdts wrote:


 This unsupported feature is supported with the use of Sun Ops Center
 2.5 when a zone is put on a NAS Storage Library.


 Ah, ok.  I didn't know that.



 Does anyone know how that works? I can't find it in the docs, no one inside
 of Sun seemed to have a clue when I asked around, etc. RTFM gladly taken.

Storage libraries are discussed very briefly at:

http://wikis.sun.com/display/OC2dot5/Storage+Libraries

Creation of zones is discussed at:

http://wikis.sun.com/display/OC2dot5/Creating+Zones

I've found no documentation that explains the implementation details.
From looking at a test environment that I have running, it seems to go
like:

1. The storage admin carves out some NFS space and exports it with the
appropriate options to the  various hosts (global zones).

2. In the Ops Center BUI, the ops center admin creates a new storage
library.  He selects type NFS and specifies the hostname and path that
was allocated.

3. The ops center admin associates the storage library with various
hosts.  This causes it to be be mounted at
/var/mnt/virtlibs/libraryId on those hosts.  I'll call this $libmnt.

4. When the sysadmin provisions a zone through ops center, a UUID is
allocated and associated with this zone.  I'll call it $zuuid.  A
directory $libmnt/$zuuid is created with a set of directories under
it.

5. As the sysadmin provisions ops center prompts for the virtual disk
size.  A file of that size is created at $libmnt/$zuuid/virtdisk/data.

6. Ops center creates a zpool:

zpool create -m /var/mnt/oc-zpools/$zuuid/ z$zuuid \
 $libmnt/$zuuid/virtdisk/data

7. The zonepath is created using a uuid that is unique to the zonepath
($puuid) z$zuuid/$puuid.  It has a quota and a reservation set (8G
each in the zpool history I am looking at).

8. The zone is configured with
zonepath=/var/mnt/oc-zpools/$zuuid/$puuid, then installed

Just in case anyone sees this as the right way to do things, I think
it is generally OK with a couple caveats. The key areas that I would
suggest for improvement are:

- Mount the NFS space with -o forcedirectio.  There is no need to
cache data twice.
- Never use UUID's in paths.  This makes it nearly impossible for a
sysadmin or a support person to look at the output of commands on the
system and understand what it is doing.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zones on shared storage - a warning

2010-01-07 Thread Mike Gerdts

[removed zones-discuss after sending heads-up that the conversation
will continue at zfs-discuss]

On Mon, Jan 4, 2010 at 5:16 PM, Cindy Swearingen
cindy.swearin...@sun.com wrote:
 Hi Mike,

 It is difficult to comment on the root cause of this failure since
 the several interactions of these features are unknown. You might
 consider seeing how Ed's proposal plays out and let him do some more
 testing...

Unfortunately Ed's proposal is not funded last I heard.  Ops Center
uses many of the same mechanisms for putting zones on ZFS.  This is
where I saw the problem initially.

 If you are interested in testing this with NFSv4 and it still fails
 the same way, then also consider testing this with a local file
 instead of a NFS-mounted file and let us know the results. I'm also
 unsure of using the same path for the pool and the zone root path,
 rather than one path for pool and a pool/dataset path for zone
 root path. I will test this myself if I get some time.

I have been unable to reproduce with a local file.  I have been able
to reproduce with NFSv4 on build 130.  Rather surprisingly the actual
checksums found in the ereports are sometimes 0x0 0x0 0x0 0x0 or
0xbaddcafe00 

Here's what I did:

- Install OpenSolaris build 130 (ldom on T5220)
- Mount some NFS space at /nfszone:
   mount -F nfs -o vers=4 $file:/path /nfszone
- Create a 10gig sparse file
   cd /nfszone
   mkfile -n 10g root
- Create a zpool
   zpool create -m /zones/nfszone nfszone /nfszone/root
- Configure and install a zone
   zonecfg -z nfszone
set zonepath = /zones/nfszone
set autoboot = false
verify
commit
exit
   chmod 700 /zones/nfszone
   zoneadm -z nfszone install

- Verify that the nfszone pool is clean.  First, pkg history in the
zone shows the timestamp of the last package operation

  2010-01-07T20:27:07 install   pkg Succeeded

At 20:31 I ran:

# zpool status nfszone
  pool: nfszone
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
nfszone  ONLINE   0 0 0
  /nfszone/root  ONLINE   0 0 0

errors: No known data errors

I booted the zone.  By 20:32 it had accumulated 132 checksum errors:

 # zpool status nfszone
  pool: nfszone
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
nfszone  DEGRADED 0 0 0
  /nfszone/root  DEGRADED 0 0   132  too many errors

errors: No known data errors

fmdump has some very interesting things to say about the actual
checksums.  The 0x0 and 0xbaddcafe00 seem to shout that these checksum
errors are not due to a couple bits flipped

# fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail
   2cksum_actual = 0x14c538b06b6 0x2bb571a06ddb0 0x3e05a7c4ac90c62
0x290cbce13fc59dce
   3cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400
0x7e0aef335f0c7f00
   3cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800
0xd4f1025a8e66fe00
   4cksum_actual = 0x0 0x0 0x0 0x0
   4cksum_actual = 0x1d32a7b7b00 0x248deaf977d80 0x1e8ea26c8a2e900
0x330107da7c4bcec0
   5cksum_actual = 0x14b8f7afe6 0x915db8d7f87 0x205dc7979ad73
0x4e0b3a8747b8a8
   6cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00
0x280934efa6d20f40
   6cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00
0x89715e34fbf9cdc0
  16cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00
0x7f84b11b3fc7f80
  48cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500
0x82804bc6ebcfc0

I halted the zone, exported the pool, imported the pool, then did a
scrub.  Everything seemed to be OK:

# zpool export nfszone
# zpool import -d /nfszone nfszone
# zpool status nfszone
  pool: nfszone
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
nfszone  ONLINE   0 0 0
  /nfszone/root  ONLINE   0 0 0

errors: No known data errors
# zpool scrub nfszone
# zpool status nfszone
  pool: nfszone
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Thu Jan  7 21:56:47 2010
config:

NAME STATE READ WRITE CKSUM
nfszone  ONLINE   0 0 0
  /nfszone/root  ONLINE   0 0 0

errors: No known data errors

But then I booted the zone...

# zoneadm -z nfszone boot
# zpool status nfszone
  pool: nfszone
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Mike Gerdts

On Tue, Jan 5, 2010 at 4:34 AM, Mikko Lammi mikko.la...@lmmz.net wrote:
 Hello,

 As a result of one badly designed application running loose for some time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

 Traditional approaches like find ./ -exec rm {} \; seem to take forever
 - after running several days, the directory size still says the same. The
 only way how I've been able to remove something has been by giving rm
 -rf to problematic directory from parent level. Running this command
 shows directory size decreasing by 10,000 files/hour, but this would still
 mean close to ten months (over 250 days) to delete everything!

 I also tried to use unlink command to directory as a root, as a user who
 created the directory, by changing directory's owner to root and so forth,
 but all attempts gave Not owner error.

 Any commands like ls -f or find will run for hours (or days) without
 actually listing anything from the directory, so I'm beginning to suspect
 that maybe the directory's data structure is somewhat damaged. Is there
 some diagnostics that I can run with e.g zdb to investigate and
 hopefully fix for a single directory within zfs dataset?

In situations like this, ls will be exceptionally slow partially
because it will sort the output.  Find is slow because it needs to
call lstat() on every entry.  In similar situations I have found the
following to work.

perl -e 'opendir(D, .); while ( $d = readdir(D) ) { print $d\n }'

Replace print with unlink if you wish...


 To make things even more difficult, this directory is located in rootfs,
 so dropping the zfs filesystem would basically mean reinstalling the
 entire system, which is something that we really wouldn't wish to go.


 OS is Solaris 10, zpool version is 10 (rather old, I know, but is there
 easy path for upgrade that might solve this problem?) and the zpool
 consists two 146 GB SAS drivers in a mirror setup.


 Any help would be appreciated.

 Thanks,
 Mikko

 --
  Mikko Lammi | l...@lmmz.net | http://www.lmmz.net

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zpool creation best practices

2010-01-03 Thread Mike

Thanks for the response Marion.  I'm glad that Im not the only one. :)

Message was edited by: mijohnst
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Mike Gerdts

On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:

 Devzero,

 Unfortunately that was my assumption as well. I don't have source level
 knowledge of ZFS, though based on what I know it wouldn't be an easy way to
 do it. I'm not even sure it's only a technical question, but a design
 question, which would make it even less feasible.

 It is not hard, because ZFS knows the current free list, so walking that
 list
 and telling the storage about the freed blocks isn't very hard.

 What is hard is figuring out if this would actually improve life.  The
 reason
 I say this is because people like to use snapshots and clones on ZFS.
 If you keep snapshots, then you aren't freeing blocks, so the free list
 doesn't grow. This is a very different use case than UFS, as an example.

It seems as though the oft mentioned block rewrite capabilities needed
for pool shrinking and changing things like compression, encryption,
and deduplication would also show benefit here.  That is, blocks would
be re-written in such a way to minimize the number of chunks of
storage that is allocated.  The current HDS chunk size is 42 MB.

The most benefit would seem to be to have ZFS make a point of reusing
old but freed blocks before doing an allocation that causes the
back-end storage to allocate another chunk of disk to the
thin-provisioned.  While it is important to be able to roll back a few
transactions in the event of some widely discussed failure modes, it
is probably reasonable to reuse a block freed by a txg that is 3,000
txg's old (about 1 day old if 1 txg per 30 seconds).  Such a threshold
could be used to determine whether to reuse a block or venture into
previously untouched regions of the disk.

This strategy would allow the SAN administrator (who is a different
person than the sysadmin) to allocate extra space to servers and the
sysadmin can control the amount of space really used by quotas.  In
the event that there is an emergency need for more space, the sysadmin
can increase the quota and allow more of the allocate SAN space to be
used.  Assuming the block rewrite feature comes to fruition, this
emergency growth could be shrunk back down to the original size once
the surge in demand (or errant process) subsides.


 There are a few minor bumps in the road. The ATA PASSTHROUGH
 command, which allows TRIM to pass through the SATA drivers, was
 just integrated into b130. This will be more important to small servers
 than SANs, but the point is that all parts of the software stack need to
 support the effort. As such, it is not clear to me who, if anyone, inside
 Sun is champion for the effort -- it crosses multiple organizational
 boundaries.


 Apart from the technical possibilities, this feature looks really
 inevitable to me in the long run especially for enterprise customers with
 high-end SAN as cost is always a major factor in a storage design and it's a
 huge difference if you have to pay based on the space used vs space
 allocated (for example).

 If the high cost of SAN storage is the problem, then I think there are
 better ways to solve that :-)

The SAN could be an OpenSolaris device serving LUNs through COMSTAR.
 If those LUNs are used to hold a zpool, the zpool could notify the
LUN that blocks are no longer used and the SAN could reclaim those
blocks.  This is just a variant of the same problem faced with
expensive SAN devices that have thin provisioning allocation units
measured in the tens of megabytes instead of hundreds to thousands of
kilobytes.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Mike Gerdts

On Wed, Dec 30, 2009 at 3:12 PM, Richard Elling
richard.ell...@gmail.com wrote:
 If the allocator can change, what sorts of policies should be
 implemented?  Examples include:
        + should the allocator stick with best-fit and encourage more
           gangs when the vdev is virtual?
        + should the allocator be aware of an SSD's page size?  Is
           said page size available to an OS?
        + should the metaslab boundaries align with virtual storage
           or SSD page boundaries?

Wandering off topic a little bit...

Should the block size be a tunable so that page size of SSD (typically
4K, right?) and upcoming hard disks that sport a sector size  512
bytes?

http://arc.opensolaris.org/caselog/PSARC/2008/769/final_spec.txt

 And, perhaps most important, how can this be done automatically
 so that system administrators don't have to be rocket scientists
 to make a good choice?

Didn't you read the marketing literature?  ZFS is easy because you
only need to know two commands: zpool and zfs.  If you just ignore all
the subcommands, options to those subcommands, evil tuning that is
sometimes needed, and effects of redundancy choices then there is no
need for any rocket scientists.  :)

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Changing ZFS drive pathing

2009-12-30 Thread Mike

Just thought I would let you all know that I followed what Alex suggested along 
with what many of you pointed out and it worked! Here are the steps I followed:

1. Break root drive mirror
2. zpool export filesystem
3. run the command to start MPIOX and reboot the machine
4. zpool import filesystem
5. Check the system
6. Recreate the mirror.

Thank you all for the help!  I feel much better and it worked without a single 
problem!  I'm very impressed with MPXIO and wish I had known about it before 
spending thousands of dollars on PowerPath.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Zpool creation best practices

2009-12-30 Thread Mike

I'm just wondering what some of you might do with your systems.  

We have an EMC Clariion unit that I connect several sun machines to.  I allow 
the EMC to do it's hardware raid5 for several luns and then I stripe them 
together.  I considered using raidz and just configuring the EMC as a JBOD, but 
I thought it would defeat the purpose paying so much for a system with the 
advanced redundancy system.  I also like to add luns on the fly when a system 
needs more file space and I know you can't do that with raidz.  

I've never had a lun go bad but bad things do happen.  Does anyone else use ZFS 
in this way?  Is this an unrecommended setup?  It's too late to change my 
setup, but in the future when I'm planning new systems, should I consider the 
effort to allow zfs fully control all the disks?

Message was edited by: mijohnst
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Zones on shared storage - a warning

2009-12-22 Thread Mike Gerdts

I've been playing around with zones on NFS a bit and have run into
what looks to be a pretty bad snag - ZFS keeps seeing read and/or
checksum errors.  This exists with S10u8 and OpenSolaris dev build
snv_129.  This is likely a blocker for anything thinking of
implementing parts of Ed's Zones on Shared Storage:

http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss

The OpenSolaris example appears below.  The order of events is:

1) Create a file on NFS, turn it into a zpool
2) Configure a zone with the pool as zonepath
3) Install the zone, verify that the pool is healthy
4) Boot the zone, observe that the pool is sick

r...@soltrain19# mount filer:/path /mnt
r...@soltrain19# cd /mnt
r...@soltrain19# mkdir osolzone
r...@soltrain19# mkfile -n 8g root
r...@soltrain19# zpool create -m /zones/osol osol /mnt/osolzone/root
r...@soltrain19# zonecfg -z osol
osol: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:osol create
zonecfg:osol info
zonename: osol
zonepath:
brand: ipkg
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: shared
hostid:
zonecfg:osol set zonepath=/zones/osol
zonecfg:osol set autoboot=false
zonecfg:osol verify
zonecfg:osol commit
zonecfg:osol exit

r...@soltrain19# chmod 700 /zones/osol

r...@soltrain19# zoneadm -z osol install
   Publisher: Using opensolaris.org (http://pkg.opensolaris.org/dev/
http://pkg-na-2.opensolaris.org/dev/).
   Publisher: Using contrib (http://pkg.opensolaris.org/contrib/).
   Image: Preparing at /zones/osol/root.
   Cache: Using /var/pkg/download.
Sanity Check: Looking for 'entire' incorporation.
  Installing: Core System (output follows)
DOWNLOAD  PKGS   FILESXFER (MB)
Completed46/46 12334/1233493.1/93.1

PHASEACTIONS
Install Phase18277/18277
No updates necessary for this image.
  Installing: Additional Packages (output follows)
DOWNLOAD  PKGS   FILESXFER (MB)
Completed36/36   3339/333921.3/21.3

PHASEACTIONS
Install Phase  4466/4466

Note: Man pages can be obtained by installing SUNWman
 Postinstall: Copying SMF seed repository ... done.
 Postinstall: Applying workarounds.
Done: Installation completed in 2139.186 seconds.

  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
  to complete the configuration process.
6.3 Boot the OpenSolaris zone
r...@soltrain19# zpool status osol
  pool: osol
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
osol  ONLINE   0 0 0
  /mnt/osolzone/root  ONLINE   0 0 0

errors: No known data errors

r...@soltrain19# zoneadm -z osol boot

r...@soltrain19# zpool status osol
  pool: osol
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
osol  DEGRADED 0 0 0
  /mnt/osolzone/root  DEGRADED 0 0   117  too many errors

errors: No known data errors

r...@soltrain19# zlogin osol uptime
  5:31pm  up 1 min(s),  0 users,  load average: 0.69, 0.38, 0.52


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zones on shared storage - a warning

2009-12-22 Thread Mike Gerdts

On Tue, Dec 22, 2009 at 8:02 PM, Mike Gerdts mger...@gmail.com wrote:
 I've been playing around with zones on NFS a bit and have run into
 what looks to be a pretty bad snag - ZFS keeps seeing read and/or
 checksum errors.  This exists with S10u8 and OpenSolaris dev build
 snv_129.  This is likely a blocker for anything thinking of
 implementing parts of Ed's Zones on Shared Storage:

 http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss

 The OpenSolaris example appears below.  The order of events is:

 1) Create a file on NFS, turn it into a zpool
 2) Configure a zone with the pool as zonepath
 3) Install the zone, verify that the pool is healthy
 4) Boot the zone, observe that the pool is sick
[snip]

An off list conversation and a bit of digging into other tests I have
done shows that this is likely limited to NFSv3.  I cannot say that
this problem has been seen with NFSv4.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] compressratio vs. dedupratio

2009-12-14 Thread Mike Gerdts

On Mon, Dec 14, 2009 at 3:54 PM, Craig S. Bell cb...@standard.com wrote:
 I am also accustomed to seeing diluted properties such as compressratio.  
 IMHO it could be useful (or perhaps just familiar) to see a diluted dedup 
 ratio for the pool, or maybe see the size / percentage of data used to arrive 
 at dedupratio.

 As Jeff points out, there is enough data available to calculate this.  Would 
 it be meaningful enough to present a diluted ratio property?  IOW, would that 
 tell me anything than I don't get from simply using available as my fuel 
 gauge?

 This is probably a larger topic:  What additional statistics would be 
 genuinely useful to the admin when there is space interaction between 
 datasets.  As we have seen, some commands are less objective with dedup:

I was recently confused when doing mkfile (or was it dd if=/dev/zero
...) and found that even though blocks were compressed away to
nothing, the compressratio did not increase.  For example:

# perl -e 'print a x 1'  /test/a
# zfs get compressratio test
NAME  PROPERTY   VALUE  SOURCE
test  compressratio  7.87x  -

However if I put null characters into the same file:

# dd if=/dev/zero of=a bs=1 count=1
1+0 records in
1+0 records out
# zfs get compressratio test
NAME  PROPERTY   VALUE  SOURCE
test  compressratio  1.00x  -

I understand that a block is not allocated if it contains all zero's,
but that would seem to contribute to a higher compressratio rather
than a lower compressratio.

If I disable compression and enable dedup, does it count deduplicated
blocks of zeros toward the dedupratio?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS - how to determine which physical drive to replace

2009-12-12 Thread Mike Gerdts

On Sat, Dec 12, 2009 at 9:58 AM, Edward Ned Harvey
sola...@nedharvey.com wrote:
 I would suggest something like this:  While the system is still on, if the
 failed drive is at least writable *a little bit* … then you can “dd
 if=/dev/zero of=/dev/rdsk/FailedDiskDevice bs=1024 count=1024” … and then
 after the system is off, you could plug the drives into another system
 one-by-one, and read the first 1M, and see if it’s all zeros.   (Or instead
 of dd zero, you could echo some text onto the drive, or whatever you think
 is easiest.)


How about reading instead?

dd if=/dev/rdsk/$whatever of=/dev/null

If the failed disk generates I/O errors that prevent it from reading
at a rate that causes an LED to blink, you could read from all of the
good disks.  The one that doesn't blink is the broken one.

You can also get the drive serial number with iostat -En:

$ iostat -En
c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: Hitachi HTS5425 Revision:  Serial No: 080804BB6300HCG Size:
160.04GB 160039305216 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
...

That /should/ be printed on the disk somewhere.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Changing ZFS drive pathing

2009-12-10 Thread Mike Johnston

Thanks for the info Alexander... I will test this out.  I'm just wondering
what it's going to see after I install Power Path.  Since each drive will
have 4 paths, plus the Power Path...  after doing a zfs import how will I
force it to use a specific path?  Thanks again!  Good to know that this can
be done.

On Wed, Dec 9, 2009 at 5:16 AM, Alexander J. Maidak ajmai...@mchsi.comwrote:

 On Tue, 2009-12-08 at 09:15 -0800, Mike wrote:
  I had a system that I was testing zfs on using EMC Luns to create a
 striped zpool without using the multi-pathing software PowerPath.  Of coarse
 a storage emergency came up so I lent this storage out for temp storage and
 we're still using.  I'd like to add PowerPath to take advanage of the
 multi-pathing in case I lose and SFP (or entire switch for that matter) but
 I'm not exactly sure what I can do.
 
  So my zpool currently looks like:
 
 ...
 
  I would image (because I haven't tried it yet) that it would require
 using zfs export/import in order to make this happen.  Has anyone tried
 this?  Am I fubar?  Thanks for the help!  Great forum btw...

 When I've done this in the past its been a pool export, reconfigure
 storage, pool import procedure.

 In my dark PowerPath days I recall PowerPath couldn't handle the using
 the emcpower# disk name.  You had to format the emcpower device and
 create you're zpool on the emcpower1a slice.  This may have changed in
 the newer versions of PowerPath (the last version I ran was 5.0.0_b141).

 I've since switched to using Sun MPxIO for multipathing.  Its worked
 fine so far its certain to support you're ZFS config.  Just export
 you're pool, run the stmsboot -e commmand, reboot, and re-import
 you're pool.

 -Alex






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Changing ZFS drive pathing

2009-12-09 Thread Mike

Alex, thanks for the info.  You made my heart stop a little when reading your 
problem with PowerPath, but MPxIO seems like it might be a good option for me.  
I'll will try that as well although I have not used it before.  Thank you!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Changing ZFS drive pathing

2009-12-08 Thread Mike

I had a system that I was testing zfs on using EMC Luns to create a striped 
zpool without using the multi-pathing software PowerPath.  Of coarse a storage 
emergency came up so I lent this storage out for temp storage and we're still 
using.  I'd like to add PowerPath to take advanage of the multi-pathing in case 
I lose and SFP (or entire switch for that matter) but I'm not exactly sure what 
I can do.

So my zpool currently looks like:

##
# zpool status -v
  pool: myzfs
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
myzfs   ONLINE   0 0 0
  1234567890ONLINE   0 0 0
  1234567891ONLINE   0 0 0
###

So how would I change the path after I install PowerPath to use the multi-path? 
 So 1234567890  would be equal to /dev/dsk/emcpower1 and 1234567891 would be 
equal to /dev/dsk/emcpower2.

In the end it would look like:

##
# zpool status -v
  pool: myzfs
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
myzfs   ONLINE   0 0 0
  emcpower1  ONLINE   0 0 0
  emcpower1  ONLINE   0 0 0
###

I would image (because I haven't tried it yet) that it would require using zfs 
export/import in order to make this happen.  Has anyone tried this?  Am I 
fubar?  Thanks for the help!  Great forum btw...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Changing ZFS drive pathing

2009-12-08 Thread Mike

Thanks Cindys for your input...  I love your fear example too, but lucky for me 
I have 10 years before I have to worry about that and hopefully we'll all be in 
hovering bumper cars by then.

It looks like I'm going to have to create another test system and try 
recommondations give here...and hope that another emergency doesn't arrise... :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Any way to remove a vdev

2009-12-02 Thread Mike Freeman

I'm sure its been asked a thousand times but is there any prospect of being
able to remove a vdev from a pool anytime soon?

Thanks!

-- 
Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 5 >

1 - 100 of 472 matches

Mail list logo