Re: [zfs-discuss] Pools inside pools

2010-09-23 Thread Mattias Pantzare
On Thu, Sep 23, 2010 at 08:48, Haudy Kazemi kaze0...@umn.edu wrote:
 Mattias Pantzare wrote:

 ZFS needs free memory for writes. If you fill your memory with dirty
 data zfs has to flush that data to disk. If that disk is a virtual
 disk in zfs on the same computer those writes need more memory from
 the same memory pool and you have a deadlock.
 If you write to a zvol on a different host (via iSCSI) those writes
 use memory in a different memory pool (on the other computer). No
 deadlock.

 Isn't this a matter of not keeping enough free memory as a workspace?  By
 free memory, I am referring to unallocated memory and also recoverable
main
 memory used for shrinkable read caches (shrinkable by discarding cached
 data).  If the system keeps enough free and recoverable memory around for
 workspace, why should the deadlock case ever arise?  Slowness and page
 swapping might be expected to arise (as a result of a shrinking read cache
 and high memory pressure), but deadlocks too?

Yes. But what is enough reserved free memory? If you need 1Mb for a normal
configuration you might need 2Mb when you are doing ZFS on ZFS. (I am just
guessing).

This is the same problem as mounting an NFS server on itself via NFS. Also
not supported.

The system has shrinkable caches and so on, but that space will sometimes
run out. All of it. There is also swap to use, but if that is on ZFS

These things are also very hard to test.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pools inside pools

2010-09-22 Thread Mattias Pantzare
On Wed, Sep 22, 2010 at 20:15, Markus Kovero markus.kov...@nebula.fi wrote:


 Such configuration was known to cause deadlocks. Even if it works now (which 
 I don't expect to be the case) it will make your data to be cached twice. 
 The CPU utilization  will also be much higher, etc.
 All in all I strongly recommend against such setup.

 --
 Pawel Jakub Dawidek                       http://www.wheelsystems.com
 p...@freebsd.org                           http://www.FreeBSD.org
 FreeBSD committer                         Am I Evil? Yes, I Am!

 Well, CPU utilization can be tuned downwards by disabling checksums in inner 
 pools as checksumming is done in main pool. I'd be interested in bug id's for 
 deadlock issues and everything related. Caching twice is not an issue, 
 prefetching could be and it can be disabled
 I don't understand what makes it difficult for zfs to handle this kind of 
 setup. Main pool (testpool) should just allow any writes/reads to/from 
 volume, not caring what they are, where as anotherpool would just work as any 
 other pool consisting of any other devices.
 This is quite similar setup to iscsi-replicated mirror pool, where you have 
 redundant pool created from iscsi volumes locally and remotely.

ZFS needs free memory for writes. If you fill your memory with dirty
data zfs has to flush that data to disk. If that disk is a virtual
disk in zfs on the same computer those writes need more memory from
the same memory pool and you have a deadlock.
If you write to a zvol on a different host (via iSCSI) those writes
use memory in a different memory pool (on the other computer). No
deadlock.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-08 Thread Mattias Pantzare
On Wed, Sep 8, 2010 at 06:59, Edward Ned Harvey sh...@nedharvey.com wrote:
 On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey sh...@nedharvey.com
 wrote:

 I think the value you can take from this is:
 Why does the BPG say that?  What is the reasoning behind it?

 Anything that is a rule of thumb either has reasoning behind it (you
 should know the reasoning) or it doesn't (you should ignore the rule of
 thumb, dismiss it as myth.)

 Let's examine the myth that you should limit the number of drives in a vdev
 because of resilver time.  The myth goes something like this:  You shouldn't
 use more than ___ drives in a vdev raidz_ configuration, because all the
 drives need to read during a resilver, so the more drives are present, the
 longer the resilver time.

 The truth of the matter is:  Only the size of used data is read.  Because
 this is ZFS, it's smarter than a hardware solution which would have to read
 all disks in their entirety.  In ZFS, if you have a 6-disk raidz1 with
 capacity of 5 disks, and a total of 50G of data, then each disk has roughly
 10G of data in it.  During resilver, 5 disks will each read 10G of data, and
 10G of data will be written to the new disk.  If you have a 11-disk raidz1
 with capacity of 10 disks, then each disk has roughly 5G of data.  10 disks
 will each read 5G of data, and 5G of data will be written to the new disk.
 If anything, more disks means a faster resilver, because you're more easily
 able to saturate the bus, and you have a smaller amount of data that needs
 to be written to the replaced disk.

It is not a question of a vdev with 6 disk vs a vdev with 12 disks. It
is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
vdev you have to read half the data compared to 1 vdev to resilver a
disk.

Or look at it this way, you will put more data on a 12 disk vdev than
on a 6 disk vdev.

IO other than the resilver will also slow the resilver down more if
you have large vdevs.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-08 Thread Mattias Pantzare
On Wed, Sep 8, 2010 at 15:27, Edward Ned Harvey sh...@nedharvey.com wrote:
 From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of
 Mattias Pantzare

 It
 is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
 vdev you have to read half the data compared to 1 vdev to resilver a
 disk.

 Let's suppose you have 1T of data.  You have 12-disk raidz2.  So you have
 approx 100G on each disk, and you replace one disk.  Then 11 disks will each
 read 100G, and the new disk will write 100G.

 Let's suppose you have 1T of data.  You have 2 vdev's that are each 6-disk
 raidz1.  Then we'll estimate 500G is on each vdev, so each disk has approx
 100G.  You replace a disk.  Then 5 disks will each read 100G, and 1 disk
 will write 100G.

 Both of the above situations resilver in equal time, unless there is a bus
 bottleneck.  21 disks in a single raidz3 will resilver just as fast as 7
 disks in a raidz1, as long as you are avoiding the bus bottleneck.  But 21
 disks in a single raidz3 provides better redundancy than 3 vdev's each
 containing a 7 disk raidz1.

 In my personal experience, approx 5 disks can max out approx 1 bus.  (It
 actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks
 on a good bus, or good disks on a crap bus, but generally speaking people
 don't do that.  Generally people get a good bus for good disks, and cheap
 disks for crap bus, so approx 5 disks max out approx 1 bus.)

 In my personal experience, servers are generally built with a separate bus
 for approx every 5-7 disk slots.  So what it really comes down to is ...

 Instead of the Best Practices Guide saying Don't put more than ___ disks
 into a single vdev, the BPG should say Avoid the bus bandwidth bottleneck
 by constructing your vdev's using physical disks which are distributed
 across multiple buses, as necessary per the speed of your disks and buses.

This is assuming that you have no other IO besides the scrub.

You should of course keep the number of disks in a vdev low for
general performance reasons unless you only have linear reads (as your
IOPS will be close to what only one disk can give for the whole vdev).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs lists discrepancy after added a new vdev to pool

2010-08-28 Thread Mattias Pantzare
On Sat, Aug 28, 2010 at 02:54, Darin Perusich
darin.perus...@cognigencorp.com wrote:
 Hello All,

 I'm sure this has been discussed previously but I haven't been able to find an
 answer to this. I've added another raidz1 vdev to an existing storage pool and
 the increased available storage isn't reflected in the 'zfs list' output. Why
 is this?

 The system in question is runnning Solaris 10 5/09 s10s_u7wos_08, kernel
 Generic_139555-08. The system does not have the lastest patches which might be
 the cure.

 Thanks!

 Here's what I'm seeing.
 zpool create datapool raidz1 c1t50060E800042AA70d0  c1t50060E800042AA70d1

 zpool status
  pool: datapool
  state: ONLINE
  scrub: none requested
 config:

        NAME                       STATE     READ WRITE CKSUM
        datapool                   ONLINE       0     0     0
          raidz1                   ONLINE       0     0     0
            c1t50060E800042AA70d0  ONLINE       0     0     0
            c1t50060E800042AA70d1  ONLINE       0     0     0

 zfs list
 NAME       USED  AVAIL  REFER  MOUNTPOINT
 datapool   108K   196G    18K  /datapool

 zpool add datapool raidz1 c1t50060E800042AA70d2 c1t50060E800042AA70d3

 zpool status
  pool: datapool
  state: ONLINE
  scrub: none requested
 config:

        NAME                       STATE     READ WRITE CKSUM
        datapool                   ONLINE       0     0     0
          raidz1                   ONLINE       0     0     0
            c1t50060E800042AA70d0  ONLINE       0     0     0
            c1t50060E800042AA70d1  ONLINE       0     0     0
          raidz1                   ONLINE       0     0     0
            c1t50060E800042AA70d2  ONLINE       0     0     0
            c1t50060E800042AA70d3  ONLINE       0     0     0

 zfs list
 NAME       USED  AVAIL  REFER  MOUNTPOINT
 datapool   112K   392G    18K  /datapool

I think you have to explain your problem more, 392G is more than 196G?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk space overhead (total volume size) by ZFS

2010-05-30 Thread Mattias Pantzare
On Sun, May 30, 2010 at 23:37, Sandon Van Ness san...@van-ness.com wrote:
 I just wanted to make sure this is normal and is expected. I fully
 expected that as the file-system filled up I would see more disk space
 being used than with other file-systems due to its features but what I
 didn't expect was to lose out on ~500-600GB to be missing from the total
 volume size right at file-system creation.

 Comparing two systems, one being JFS and one being ZFS, one being raidz2
 one being raid6. Here is the differences I see:

 ZFS:
 r...@opensolaris: 11:22 AM :/data# df -k /data
 Filesystem            kbytes    used   avail capacity  Mounted on
 data                 17024716800 258872352 16765843815     2%    /data

 JFS:
 r...@sabayonx86-64: 11:22 AM :~# df -k /data2
 Filesystem           1K-blocks      Used Available Use% Mounted on
 /dev/sdd1            17577451416   2147912 17575303504   1% /data2

 zpool list shows the raw capacity right?

 r...@opensolaris: 11:25 AM :/data# zpool list data
 NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
 data   18.1T   278G  17.9T     1%  1.00x  ONLINE  -

 Ok, i would expect it to be rounded to 18.2 but that seems about right
 for 20 trillion bytes (what 20x1 TB is):

 r...@sabayonx86-64: 11:23 AM :~# echo | awk '{print
 20/1024/1024/1024/1024}'
 18.1899

 Now minus two drives for parity:

 r...@sabayonx86-64: 11:23 AM :~# echo | awk '{print
 18/1024/1024/1024/1024}'
 16.3709

 Yet when running zfs list it also lists the amount of storage
 significantly smaller:

 r...@opensolaris: 11:23 AM :~# zfs list data
 NAME   USED  AVAIL  REFER  MOUNTPOINT
 data   164K  15.9T  56.0K  /data

 I would expect this to be 16.4T.

 Taking the df -k values JFS gives me a total volume size of:

 r...@sabayonx86-64: 11:31 AM :~# echo | awk '{print
 17577451416/1024/1024/1024}'
 16.3703

 and zfs is:

 r...@sabayonx86-64: 11:31 AM :~# echo | awk '{print
 17024716800/1024/1024/1024}'
 15.8555

 So basically with JFS I see no decrease in total volume size but a huge
 difference on ZFS. Is this normal/expected? Can anything be disabled to
 not lose 500-600 GB of space?

This may be the answer:
http://www.cuddletech.com/blog/pivot/entry.php?id=1013
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reverse lookup: inode to name lookup

2010-05-01 Thread Mattias Pantzare
On Sat, May 1, 2010 at 16:23,  casper@sun.com wrote:


I understand you cannot lookup names by inode number in general, because
that would present a security violation.  Joe User should not be able to
find the name of an item that's in a directory where he does not have
permission.



But, even if it can only be run by root, is there some way to lookup the
name of an object based on inode number?

 Sure, that's typically how NFS works.

 The inode itself is not sufficient; an inode number might be recycled and
 and old snapshot with the same inode number may refer to a different file.

No, a NFS client will not ask the NFS server for a name by sending the
inode or NFS-handle. There is no need for a NFS client to do that.

There is no way to get a name from an inode number.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reverse lookup: inode to name lookup

2010-05-01 Thread Mattias Pantzare
On Sat, May 1, 2010 at 16:49,  casper@sun.com wrote:


No, a NFS client will not ask the NFS server for a name by sending the
inode or NFS-handle. There is no need for a NFS client to do that.

 The NFS clients certainly version 2 and 3 only use the file handle;
 the file handle can be decoded by the server.  It filehandle does not
 contain the name, only the FSid, the inode number and the generation.


There is no way to get a name from an inode number.

 The nfs server knows how so it is clearly possible.  It is not exported to
 userland but the kernel can find a file by its inumber.

The nfs server can find the file but not the file _name_.

inode is all that the NFS server needs, it does not need the file name
if it has the inode number.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reverse lookup: inode to name lookup

2010-05-01 Thread Mattias Pantzare
 If the kernel (or root) can open an arbitrary directory by inode number,
 then the kernel (or root) can find the inode number of its parent by looking
 at the '..' entry, which the kernel (or root) can then open, and identify
 both:  the name of the child subdir whose inode number is already known, and
 (b) yet another '..' entry.  The kernel (or root) can repeat this process
 recursively, up to the root of the filesystem tree.  At that time, the
 kernel (or root) has completely identified the absolute path of the inode
 that it started with.

 The only question I want answered right now is:

 Although it is possible, is it implemented?  Is there any kind of function,
 or existing program, which can be run by root, to obtain either the complete
 path of a directory by inode number, or to simply open an inode by number,
 which would leave the recursion and absolute path generation yet to be
 completed?

You can do in the kernel by calling vnodetopath(). I don't know if it
is exposed to user space.

But that could be slow if you have large directories so you have to
think about where you would use it.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Secure delete?

2010-04-12 Thread Mattias Pantzare
 OpenSolaris needs support for the TRIM command for SSDs.  This command is
 issued to an SSD to indicate that a block is no longer in use and the SSD
 may erase it in preparation for future writes.

 There does not seem to be very much `need' since there are other ways that a
 SSD can know that a block is no longer in use so it can be erased.  In fact,
 ZFS already uses an algorithm (COW) which is friendly for SSDs.

What ways would that be?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Secure delete?

2010-04-12 Thread Mattias Pantzare
On Mon, Apr 12, 2010 at 19:19, David Magda dma...@ee.ryerson.ca wrote:
 On Mon, April 12, 2010 12:28, Tomas Ögren wrote:
 On 12 April, 2010 - David Magda sent me these 0,7K bytes:

 On Mon, April 12, 2010 10:48, Tomas Ögren wrote:

  For flash to overwrite a block, it needs to clear it first.. so yes,
  clearing it out in the background (after erasing) instead of just
  before the timing critical write(), you can make stuff go faster.

 Except that ZFS does not overwrite blocks because it is copy-on-write.

 So CoW will enable infinite storage, so you never have to write on the
 same place again? Cool.

 Your comment was regarding making write()s go faster by pre-clearing
 unused blocks so there's always writable blocks available. Because ZFS
 doesn't go to the same LBAs when writing data, the SSD doesn't have to
 worry about read-modify-write circumstances like it has to with
 traditional file systems.

 Given that ZFS probably would not have to go back to old blocks until
 it's reached the end of the disk, that should give the SSDs' firmware
 plenty of time to do block-remapping and background erasing--something
 that's done now anyway regardless of whether an SSD supports TRIM or not.
 You don't need TRIM to make ZFS go fast, though it doesn't hurt.

Why would the disk care about if the block was written recently? There
is old data on it that has to be preserved anyway. The SSD does not
know if the old data was important.

ZFS will overwrite just as any other filesystem.

The only thing that makes ZFS SSD friendly is that it tries to make
large writes. But that only works if you have few synchronous writes.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Mattias Pantzare
On Fri, Apr 2, 2010 at 16:24, Edward Ned Harvey solar...@nedharvey.com wrote:
 The purpose of the ZIL is to act like a fast log for synchronous
 writes.  It allows the system to quickly confirm a synchronous write
 request with the minimum amount of work.

 Bob and Casper and some others clearly know a lot here.  But I'm hearing
 conflicting information, and don't know what to believe.  Does anyone here
 work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim I can
 answer this question, I wrote that code, or at least have read it?

 Questions to answer would be:

 Is a ZIL log device used only by sync() and fsync() system calls?  Is it
 ever used to accelerate async writes?

sync() will tell the filesystems to flush writes to disk. sync() will
not use ZIL, it will just start a new TXG, and could return before the
writes are done.

fsync() is what you are interested in.


 Suppose there is an application which sometimes does sync writes, and
 sometimes async writes.  In fact, to make it easier, suppose two processes
 open two files, one of which always writes asynchronously, and one of which
 always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
 for writes to be committed to disk out-of-order?  Meaning, can a large block
 async write be put into a TXG and committed to disk before a small sync
 write to a different file is committed to disk, even though the small sync
 write was issued by the application before the large async write?  Remember,
 the point is:  ZIL is disabled.  Question is whether the async could
 possibly be committed to disk before the sync.


Writers from a TXG will not be used until the whole TXG is committed to disk.
Everything from a half written TXG will be ignored after a crash.

This means that the order of writes within a TXG is not important.

The only way to do a sync write without ZIL is to start a new TXG
after the write. That costs a lot so we have the ZIL for sync writes.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-03-10 Thread Mattias Pantzare
 These days I am a fan for forward check access lists, because any one who
 owns a DNS server can say that for IPAddressX returns aserver.google.com.
 They can not set the forward lookup outside of their domain  but they can
 setup a reverse lookup. The other advantage is forword looking access lists
 is you can use DNS Alias in access lists as well.

That is not true, you have to have a valid A record in the correct domain.

This is how it works (and how you should check you reverse lookups in
your applications):

1. Do a reverse lookup.
2. Do a lookup with the name from 1.
3. Check that the IP address is one of the adresses you got in 2.

Ignore the reverse lookup if the check in 3 fails.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is there something like udev in OpenSolaris

2010-02-20 Thread Mattias Pantzare
On Sat, Feb 20, 2010 at 11:14, Lutz Schumann
presa...@storageconcepts.de wrote:
 Hello list,

 beeing a Linux Guy I'm actually quite new to Opensolaris. One thing I miss is 
 udev.
 I found that when using SATA disks with ZFS - it always required manual
 intervention (cfgadm) to do SATA hot plug.

 I would like to automate the disk replacement, so that it is a fully automatic
 process without manual intervention if:

 a) the new disk contains no ZFS labels
 b) the new disk does not contain a partition table

 .. thus it is a real replacement part

 On Linux I would write a udev hot plug script to automate this.

 Is there something like udev on OpenSolaris ?
 (A place / hook that is executed every time new hardware is added / detected)

Have you tried to set autoreplace to on on your pool?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is ZFS internal reservation excessive?

2010-01-18 Thread Mattias Pantzare
 Ext2/3 uses 5% by default for root's usage; 8% under FreeBSD for FFS.
 Solaris (10) uses a bit more nuance for its UFS:

 That reservation is to preclude users to exhaust diskspace in such a way
 that ever root can not login and solve the problem.

No, the reservation in UFS/FFS is to keep the performance up. It will
be harder and harder to find free space as the disk fills. Is is even
more important for ZFS to be able to find free space as all writes
need free space.

The root-thing is just a side effect.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Repeating scrub does random fixes

2010-01-10 Thread Mattias Pantzare
On Sun, Jan 10, 2010 at 16:40, Gary Gendel g...@genashor.com wrote:
 I've been using a 5-disk raidZ for years on SXCE machine which I converted to 
 OSOL.  The only time I ever had zfs problems in SXCE was with snv_120, which 
 was fixed.

 So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on 
 random disks.  If I repeat the scrub, it will fix errors on other disks.  
 Occasionally it runs cleanly.  That it doesn't happen in a consistent manner 
 makes me believe it's not hardware related.


That is a good indication for hardware related errors. Software will
do the same thing every time but hardware errors are often random.

But you are running an older version now, I would recommend an upgrade.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Mattias Pantzare
On Wed, Dec 30, 2009 at 19:23, roland devz...@web.de wrote:
 making transactional,logging filesystems thin-provisioning aware should be 
 hard to do, as every new and every changed block is written to a new location.
 so what applies to zfs, should also apply to btrfs or nilfs or similar 
 filesystems.

If that where a problem it would be a problem for UFS when you write
new files...

ZFS knows what blocks are free and that is all you need send to the disk system.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Mattias Pantzare
On Tue, Dec 29, 2009 at 18:16, Brad bene...@yahoo.com wrote:
 @eric

 As a general rule of thumb, each vdev has the random performance
 roughly the same as a single member of that vdev. Having six RAIDZ
 vdevs in a pool should give roughly the performance as a stripe of six
 bare drives, for random IO.

 It sounds like we'll need 16 vdevs striped in a pool to at least get the 
 performance of 15 drives plus another 16 mirrored for redundancy.

 If we are bounded in iops by the vdev, would it make sense to go with the 
 bare minimum of drives (3) per vdev?

Minimum is 1 drive per vdev. Minimum with redundancy is 2 if you use
mirroring. You should do mirroring to get the best performance.

 This winds up looking similar to RAID10 in layout, in that you're
 striping across a lot of disks that each consists of a mirror, though
 the checksumming rules are different. Performance should also be
 similar, though it's possible RAID10 may give slightly better random
 read performance at the expense of some data quality guarantees, since
 I don't believe RAID10 normally validates checksums on returned data
 if the device didn't return an error. In normal practice, RAID10 and
 a pool of mirrored vdevs should benchmark against each other within
 your margin of error.

 That's interesting to know that with ZFS's implementation of raid10 it 
 doesn't have checksumming built-in.

He was talking about RAID10, not mirroring in ZFS. ZFS will always use
checksums.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Moving a pool from FreeBSD 8.0 to opensolaris

2009-12-24 Thread Mattias Pantzare
On Thu, Dec 24, 2009 at 04:36, Ian Collins i...@ianshome.com wrote:
 Mattias Pantzare wrote:

 I'm not sure how to go about it.  Basically, how should i format my
 drives in FreeBSD, create a ZPOOL which can be imported into
 OpenSolaris.


 I'm not sure about BSD, but Solaris ZFS works with whole devices.  So
 there isn't any OS specific formatting involved.  I assume BSD does the
 same.


 That is not true. ZFS will use a EFI partition table with one
 partition if you give it the whole disk.



 An EFI label isn't OS specific formatting!

It is. Not all OS will read an EFI label.

Whole device on BSD is really the whole device. No partition table.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Moving a pool from FreeBSD 8.0 to opensolaris

2009-12-24 Thread Mattias Pantzare
 An EFI label isn't OS specific formatting!

 It is. Not all OS will read an EFI label.

 You misunderstood the concept of OS specific, I feel. EFI is indeed OS
 independent; however, that doesn't necesssarily imply that all OSs can read
 EFI disks. My Commodore 128D could boot CP/M but couldn't understand FAT32 -
 that doesn't mean that therefore FAT32 isn't OS independent either.

PC partition table is also not OS specific and is OS independent. Most
partition tables are OS independent, they ar often specified by how
you boot the platform.

On a PC EFI is very OS specific as most OS on that platform does not
support EFI.

EFI is platform independent for solaris. Solaris on sparc and solaris
on PC uses different partition tables unless you use EFI as EFI is
supported by Solaris on sparc and x86. This is manly a Solaris problem
as there is no reason for Solaris on sparc to not read a pc partition
table.

The reason that EFI is used by default for zfs is that it is platform
independent (and that it can handle bigger disks on sparc). Unless you
have to boot from it, then it is very platform dependent...

But the point was that ZFS on solaris have to have a partition table,
so you must make a partition table on FreeBSD that solaris can read.
It does not mater if the format is OS specific or not.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Moving a pool from FreeBSD 8.0 to opensolaris

2009-12-23 Thread Mattias Pantzare
 I'm not sure how to go about it.  Basically, how should i format my
 drives in FreeBSD, create a ZPOOL which can be imported into OpenSolaris.

 I'm not sure about BSD, but Solaris ZFS works with whole devices.  So there 
 isn't any OS specific formatting involved.  I assume BSD does the same.

That is not true. ZFS will use a EFI partition table with one
partition if you give it the whole disk.

My guess is that you should put it in an EFI partition. But a normal
partition should work.

I would test this in virtualbox or vmware if I where you.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI with Deduplication, is there any point?

2009-12-20 Thread Mattias Pantzare
 I have already run into one little snag that I don't see any way of 
 overcoming with my chosen method.  I've upgraded to snv_129 with high hopes 
 for getting the most out of deduplication.  But using iSCSI volumes I'm not 
 sure how I can gain any benefit from it.  The volumes are a set size, Windows 
 sees those volumes as that size despite any sort of block level deduplication 
 or compression taking place on the other side of the iSCSI connection.  I 
 can't create volumes that add up to more than the original pool size from 
 what I can tell.  I can see the pool is saving space but it doesn't appear to 
 become available to zfs volumes.  Dedup being pretty new I haven't found much 
 on the subject online.

Create sparse volumes. -s when you create at volume or change the
reservation on your volumes. Search for sparse in the zfs man-page.
And don't run out of space. :-)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Something wrong with zfs mount

2009-12-14 Thread Mattias Pantzare
 Is there better solution to this problem, what if the machine crashes?


 Crashes are abnormal conditions. If it crashes you should fix the problem to
 avoid future crashes and probably you will need to clear the pool dir
 hierarchy prior to import the pool.

Are you serious? I really hope that you have nothing to do with OS
development, or  database development for that matter.

A crash can be something as simple as the battery in a portable
computer running out of power.

A user should _not_ have to do anything manual to get the system going again.

We got rid off manual fsck many years ago. Let's not move back to the stone age!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Boot Recovery after Motherboard Death

2009-12-12 Thread Mattias Pantzare
On Sat, Dec 12, 2009 at 18:08, Richard Elling richard.ell...@gmail.com wrote:
 On Dec 12, 2009, at 12:53 AM, dick hoogendijk wrote:

 On Sat, 2009-12-12 at 00:22 +, Moritz Willers wrote:

 The host identity had - of course - changed with the new motherboard
 and it no longer recognised the zpool as its own.  'zpool import -f
 rpool' to take ownership, reboot and it all worked no problem (which
 was amazing in itself as I had switched from AMD to Intel ...).

 Do I understand correctly if I read this as: OpenSolaris is able to
 switch between systems without reinstalling? Just a zfs import -f and
 everything runs? Wow, that would be an improvemment and would make
 things more like *BSD/linux.

 Solaris has been able to do that for 20+ years.  Why do you think
 it should be broken now?

Solaris has _not_ been able to do that for 20+ years. In fact Sun has
always recommended a reinstall. You could do it if you really knew
how, but it was not easy.

If you switch between identical system it will of course work fine
(before zfs that is, now you may have to import the pool on the new
system).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] moving files from one fs to another, splittin/merging

2009-09-24 Thread Mattias Pantzare
 Thanks for the info. Glad to hear it's in the works, too.

It is not in the works. If you look at the bug IDs in the bug database
you will find no indication of work done on them.



 Paul


 1:21pm, Mark J Musante wrote:

 On Thu, 24 Sep 2009, Paul Archer wrote:

 I may have missed something in the docs, but if I have a file in one FS,
 and want to move it to another FS (assuming both filesystems are on the same
 ZFS pool), is there a way to do it outside of the standard mv/cp/rsync
 commands?

 Not yet.  CR 6483179 covers this.

 On a related(?) note, is there a way to split an existing filesystem?

 Not yet.  CR 6400399 covers this.


 Regards,
 markm

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Real help

2009-09-21 Thread Mattias Pantzare
On Mon, Sep 21, 2009 at 13:34, David Magda dma...@ee.ryerson.ca wrote:
 On Sep 21, 2009, at 06:52, Chris Ridd wrote:

 Does zpool destroy prompt are you sure in any way? Some admin tools do
 (beadm destroy for example) but there's not a lot of consistency.

 No it doesn't, which I always found strange.

 Personally I always thought you should be queried for a zfs destroy, but
 add a -f option for things like scripts. Not sure if things can be changed
 now.

You can import a destroyed pool, you can find them wiht zpool import -D.

But the problem in this case was not zpool destroy, it was zpool
create. zpool create will overwrite whatever was on the partition.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send older version?

2009-09-16 Thread Mattias Pantzare
On Wed, Sep 16, 2009 at 09:34, Erik Trimble erik.trim...@sun.com wrote:
 Carson Gaspar wrote:

 Erik Trimble wrote:
   I haven't see this specific problem, but it occurs to me thus:

 For the reverse of the original problem, where (say) I back up a 'zfs
 send' stream to tape, then later on, after upgrading my system, I want to
 get that stream back.

 Does 'zfs receive' support reading a version X stream and dumping it into
 a version X+N zfs filesystem?

 If not, frankly, that's a higher priority than the reverse.

 Your question confuses me greatly - am I missing something? zfs recv of
 a full stream will create a new filesystem of the appropriate version, which
 you may then zfs upgrade if you wish. And restoring incrementals to a
 different fs rev doesn't make sense. As long as support for older fs
 versions isn't removed from the kernel, this shouldn't ever be a problem.

 You are correct in that restoring a full stream creates the appropriate
 versioned filesystem. That's not the problem.

 The /much/ more likely scenario is this:

 (1) Let's say I have a 2008.11 server.  I back up the various ZFS
 filesystems, with both incremental and full streams off to tape.

 (2) I now upgrade that machine to 2009.05, and upgrade all the zpool/zfs
 filesystems to the later versions, which is what most people will do.

 (3) Now, I need to get back a snapshot from before step #2.  I don't want a
 full stream recovery, just a little bit of data.  I now am in the situation
 that I have a current (active) ZFS filesystem which has a later version than
 the (incremental) stream I stored earlier.


 This is what a typical recover instance is.  If I can't recover an
 incremental into an existing filesystem, it effectively means my backups are
 lost and useless. (not quite true, but it creates a huge headache.)


Congratulations! You now know why you should use a backup program
instead of zfs send for your backups. (There are more reasons than
this)

zfs send streams are not designed for backups!

(But a backup program that understand zfs send streams and uses that
instead of re cursing the filesystem would be nice...)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send speed

2009-08-18 Thread Mattias Pantzare
On Tue, Aug 18, 2009 at 22:22, Paul Krauspk1...@gmail.com wrote:
 Posted from the wrong address the first time, sorry.

 Is the speed of a 'zfs send' dependant on file size / number of files ?

        We have a system with some large datasets (3.3 TB and about 35
 million files) and conventional backups take a long time (using
 Netbackup 6.5 a FULL takes between two and three days, differential
 incrementals, even with very few files changing, take between 15 and
 20 hours). We already use snapshots for day to day restores, but we
 need the 'real' backups for DR.

Conventional backups can be faster that that! I have not used
netbackup but you should be able to configure netbackup to run several
backup streams in parallel. You may have to point netbackup to subdirs
instead of the file system root.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file change long - was zfs fragmentation

2009-08-12 Thread Mattias Pantzare
 It would be nice if ZFS had something similar to VxFS File Change Log.
 This feature is very useful for incremental backups and other
 directory walkers, providing they support FCL.

 I think this tangent deserves its own thread.  :)

 To save a trip to google...

 http://sfdoccentral.symantec.com/sf/5.0MP3/linux/manpages/vxfs/man1m/fcladm.html

 This functionality would come in very handy.  It would seem that it
 isn't too big of a deal to identify the files that changed, as this
 type of data is already presented via zpool status -v when
 corruption is detected.

 http://docs.sun.com/app/docs/doc/819-5461/gbctx?a=view

 In fact ZFS has a good transaction log, maybe the issue is there isn't
 software out there yet that uses it.

Where is that log? ZIL does not log all transactions and is cleared
very quickly.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Mattias Pantzare
  Adding another pool and copying all/some data over to it would only
  a short term solution.

 I'll have to disagree.

 What is the point of a filesystem the can grow to such a huge size and
 not have functionality built in to optimize data layout?  Real world
 implementations of filesystems that are intended to live for
 years/decades need this functionality, don't they?

 Our mail system works well, only the backup doesn't perform well.
 All the features of ZFS that make reads perform well (prefetch, ARC)
 have little effect.

 We think backup is quite important. We do quite a few restores of months
 old data. Snapshots help in the short term, but for longer term restores
 we need to go to tape.

Your scalability problem may be in your backup solution.

The problem is not how many Gb data you have but the number of files.

It was a while since I worked with networker so things may have changed.

If you are doing backups directly to tape you may have a buffering
problem. By simply staging backups on disk we got at lot faster
backups.

Have you configured networker to do several simultaneous backups from
your pool?
You can do that by having several zfs on the same pool or tell
networker to do backups one directory level down so that it thinks you
have more file systems. And don't forget to play with the parallelism
settings in networker.

This made a huge difference for us on VxFS.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-08 Thread Mattias Pantzare
On Sat, Aug 8, 2009 at 20:20, Ed Spencered_spen...@umanitoba.ca wrote:

 On Sat, 2009-08-08 at 08:14, Mattias Pantzare wrote:

 Your scalability problem may be in your backup solution.
 We've eliminated the backup system as being involved with the
 performance issues.

 The servers are Solaris 10 with the OS on UFS filesystems. (In zfs
 terms, the pool is old/mature). Solaris has been patched to a fairly
 current level.

 Copying data from the zfs filesystem to the local ufs filesystem enjoys
 the same throughput as the backup system.

 The test was simple. Create a test filesystem on the zfs pool. Restore
 production email data to it. Reboot the server. Backup the data (29
 minutes for a 15.8 gig of data). Reboot the server. Copy data from zfs
 to ufs using a 'cp -pr ...' command, which also took 29 minutes.

Yes, that was expected. What hapens if you run two cp -pr at the same
time? I am guessing that two cp will take almost the same time as one.

If you get twice the performance from two cp  then you will get twice
the performance from doing two backups in parallel.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrinking a zpool?

2009-08-06 Thread Mattias Pantzare
 If they accept virtualisation, why can't they use individual filesystems (or
 zvol) rather than pools?  What advantage do individual pools have over
 filesystems?  I'd have thought the main disadvantage of pools is storage
 flexibility requires pool shrink, something ZFS provides at the filesystem
 (or zvol) level.

You can move zpools between computers, you can't move individual file systems.

Remember that there is a SAN involved. The disk array does not run Solaris.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrinking a zpool?

2009-08-06 Thread Mattias Pantzare
On Thu, Aug 6, 2009 at 12:45, Ian Collinsi...@ianshome.com wrote:
 Mattias Pantzare wrote:

 If they accept virtualisation, why can't they use individual filesystems
 (or
 zvol) rather than pools?  What advantage do individual pools have over
 filesystems?  I'd have thought the main disadvantage of pools is storage
 flexibility requires pool shrink, something ZFS provides at the
 filesystem
 (or zvol) level.


 You can move zpools between computers, you can't move individual file
 systems.



 send/receive?

:-)
What is the downtime for doing a send/receive? What is the downtime
for zpool export, reconfigure LUN, zpool import?

And you still need to shrink the pool.

Move a 100Gb application from server A to server B using send/receive
and you will have 100Gb stuck on server A that you can't use on server
B where you relay need it.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrinking a zpool?

2009-08-06 Thread Mattias Pantzare
On Thu, Aug 6, 2009 at 16:59, Rossno-re...@opensolaris.org wrote:
 But why do you have to attach to a pool?  Surely you're just attaching to the 
 root
 filesystem anyway?  And as Richard says, since filesystems can be shrunk 
 easily
 and it's just as easy to detach a filesystem from one machine and attach to 
 it from
 another, why the emphasis on pools?

What filesystems are you talking about?
A zfs pool can be attached to one and only one computer at any given time.
All file systems in that pool are attached to the same computer.


 For once I'm beginning to side with Richard, I just don't understand why data 
 has to
 be in separate pools to do this.

All accounting for data and free blocks are done at the pool level.
That is why you can share space between file systems. You could write
code that made ZFS a cluster file system, maybe just for the pool but
that is a lot of work and would require all attached computer so talk
to each other.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] No files but pool is full?

2009-07-24 Thread Mattias Pantzare
On Fri, Jul 24, 2009 at 09:33, Markus Koveromarkus.kov...@nebula.fi wrote:
 During our tests we noticed very disturbing behavior, what would be causing
 this?

 System is running latest stable opensolaris.

 Any other means to remove ghost files rather than destroying pool and
 restoring from backups?

You may have snapshots, try:
zfs list -t snapshot
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] No files but pool is full?

2009-07-24 Thread Mattias Pantzare
On Fri, Jul 24, 2009 at 09:57, Markus Koveromarkus.kov...@nebula.fi wrote:
 r...@~# zfs list -t snapshot
 NAME                             USED  AVAIL  REFER  MOUNTPOINT
 rpool/ROOT/opensola...@install   146M      -  2.82G  -
 r...@~#

Then it is probably some process that has a deleted file open. You can
find those with:

fuser -c /testpool

But if you can't find the space after a reboot something is not right...


 -Original Message-
 From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of Mattias 
 Pantzare
 Sent: 24. heinäkuuta 2009 10:56
 To: Markus Kovero
 Cc: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] No files but pool is full?

 On Fri, Jul 24, 2009 at 09:33, Markus Koveromarkus.kov...@nebula.fi wrote:
 During our tests we noticed very disturbing behavior, what would be causing
 this?

 System is running latest stable opensolaris.

 Any other means to remove ghost files rather than destroying pool and
 restoring from backups?

 You may have snapshots, try:
 zfs list -t snapshot
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] The zfs performance decrease when enable the MPxIO round-robin

2009-07-19 Thread Mattias Pantzare
On Sun, Jul 19, 2009 at 08:25, lf yangno-re...@opensolaris.org wrote:
 Hi Guys
 I have a SunFire X4200M2 and the Xyratex RS1600 JBOD which I try to
 run the ZFS on it.But I found a problem:
 I set mpxio-disable=yes in the /kernel/drv/fp.conf to enable the MPxIO,

I assume you mean mpxio-disable=no


 and set load-balance=round-robin in the /kernel/drv/scsi_vhci.conf enable
 the round-robin.The ZFS performance is very low, it is about 50% performance
 decrease. If I disable the MPxIO or just set load-balance=none, the 
 performace
 is good to accept.

 I am confused. I google the website and found this 
 :http://xyratex.mobi/pdfs/products/storage-systems/tips/TIP107_Configuring_Solaris10_x86_for_Xyratex_storage_1-0.pdf

 It is the document of Xyratex, they have same result but no explain.Maybe it 
 is
 the SunMdi bug? How can I do this ?

That is probably a limitation in the hardware.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question about ZFS Incremental Send/Receive

2009-04-28 Thread Mattias Pantzare
O I feel like I understand what tar is doing, but I'm curious about what is it
 that ZFS is looking at that makes it a successful incremental send? That
 is, not send the entire file again. Does it have to do with how the
 application (tar in this example) does a file open, fopen(), and what mode
 is used? i.e. open for read, open for write, open for append. Or is it
 looking at a file system header, or checksum? I'm just trying to explain
 some observed behavior we're seeing during our testing.

 My proof of concept is to remote replicate these container files, which
 are created by a 3rd party application.

ZFS knows what blocks where written since the first snapshot was taken.

Filenames or type of open is not important.

If you open a file and rewrite all blocks in that file with the same
content all those block will be sent. If you rewrite 5 block only 5
blocks are sent (plus the meta data that where updated).

The way it works is that all blocks have a time stamp. Block with a time
stamp newer that the first snapshot will be sent.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How recoverable is an 'unrecoverable error'?

2009-04-16 Thread Mattias Pantzare
On Thu, Apr 16, 2009 at 11:38, Uwe Dippel udip...@gmail.com wrote:
 On Thu, Apr 16, 2009 at 1:05 AM, Fajar A. Nugraha fa...@fajar.net wrote:

 [...]

 Thanks, Fajar, et al.

 What this thread actually shows, alas, is that ZFS is rocket science.
 In 2009, one would expect a file system to 'just work'. Why would
 anyone want to have to 'status' it regularly, in case 'scrub' it, and
 if scrub doesn't do the trick (and still not knowing how serious the
 'unrecoverable error' is - like in this case), 'clear' it, 'scrub'

You don not have to status it regularly if you don't want to. Just as
with any other file system. The difference is that you can. Just as
you can and should do on your RAID system that you use with any other
file system.

If you do not have any problems ZFS will just work. If you have
problems ZFS will śhow them to you much better than EXT3, FFS, UFS or
other traditional filesystem. And often fix them for you. In many
cases you would get corrupted data or have to run fsck for the same
error on FFS/UFS.

Scrub is much nicer than fsck, it is not easy to know the best answer
to the questons that fsck will give you if you have a serious metadata
problem on FFS/UFS. And yes, you can get into trouble even on OpenBSD.

You also have to look at the complexity of your volume manager as ZFS
is both a filesystem and volume manager in one.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Notations in zpool status

2009-03-30 Thread Mattias Pantzare

 A useful way to obtain the mount point for a directory is with the
 df' command.  Just do 'df .' while in a directory to see where its
 filesystem mount point is:

 % df .
 Filesystem           1K-blocks      Used Available Use% Mounted on
 Sun_2540/home/bfriesen
                      119677846  65811409  53866437  55% /home/bfriesen


 Nice, I see by default it appears the gnu/bin is put ahead of /bin in
 $PATH, or maybe some my meddling did it, but I see running the Solaris
 df several more and confusing entries too:

 /system/contract   (ctfs              ):       0 blocks 2147483609 files

Add -h or -k to df:

df -h .
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpools on USB zpool.cache

2009-03-24 Thread Mattias Pantzare
 MP It would be nice to be able to move disks around when a system is
 MP powered off and not have to worry about a cache when I boot.

 You don't have to unless you are talking about share disks and
 importing a pool on another system while the original is powered off
 and the pool was not exported...

 For a configuration when disks are not shared among different systems
 you can move disk around without worrying about zpool.cache

So , what you are saying is that I can power off my computer, move my
zfs-disks to a different controller and then power on my computer and
the zfs file systems will show up?

zpool export is not always practical, especially on a root pool.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpools on USB zpool.cache

2009-03-23 Thread Mattias Pantzare
 I suggest ZFS at boot should (multi-threaded) scan every disk for ZFS
 disks, and import the ones with the correct host name and with a import flag
 set, without using the cache file. Maybe just use the cache file for non-EFI
 disk/partitions, but without the storing the pool name, but you should be
 able to tell ZFS to do a full scan which includes partition disk.

 Full scans are a bad thing, because they cannot scale. This is one
 good reason why zpool.cache exists.

What do you mean by cannot scale? Is it common to not use the majority
of disks available to a system?

If you taste all buses in parallel there should not be a scalability problem.


 What problem are you trying to solve?

It would be nice to be able to move disks around when a system is
powered off and not have to worry about a cache when I boot.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpools on USB zpool.cache

2009-03-23 Thread Mattias Pantzare
On Mon, Mar 23, 2009 at 22:15, Richard Elling richard.ell...@gmail.com wrote:
 Mattias Pantzare wrote:

 I suggest ZFS at boot should (multi-threaded) scan every disk for ZFS
 disks, and import the ones with the correct host name and with a import
 flag
 set, without using the cache file. Maybe just use the cache file for
 non-EFI
 disk/partitions, but without the storing the pool name, but you should
 be
 able to tell ZFS to do a full scan which includes partition disk.


 Full scans are a bad thing, because they cannot scale. This is one
 good reason why zpool.cache exists.


 What do you mean by cannot scale? Is it common to not use the majority
 of disks available to a system?


 No, it is uncommon.

So, what do you mean by cannot scale?


 If you taste all buses in parallel there should not be a scalability
 problem.


 Don't think busses, think networks.

 NB, busses are on the way out, most modern designs are point-to-point (SAS,
 SATA,
 USB) or networked (iSCSI, SAN, NAS).  Do you want to scan the internet for
 LUNs?

Du you know how a device is made available to zfs, cache or no cache?

All busses has to be probed when you do a reconfigure boot or run devfsadm.

zfs will only se the devices that you se in /dev.

If I can run zpool import in a reasonable amount of time the cahe is
not needed. Are there cases where I can't run zpool import?


 What problem are you trying to solve?


 It would be nice to be able to move disks around when a system is
 powered off and not have to worry about a cache when I boot.

 Why are you worrying about it?

If I put my disks on a diffrent controler zfs won't find them when I
boot. That is bad. It is also an extra level of complexity.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpools on USB zpool.cache

2009-03-23 Thread Mattias Pantzare
On Tue, Mar 24, 2009 at 00:21, Tim t...@tcsac.net wrote:


 On Mon, Mar 23, 2009 at 4:45 PM, Mattias Pantzare pantz...@gmail.com
 wrote:


 If I put my disks on a diffrent controler zfs won't find them when I
 boot. That is bad. It is also an extra level of complexity.

 Correct me if I'm wrong, but wading through all of your comments, I believe
 what you would like to see is zfs automatically scan if the cache is invalid
 vs. requiring manual intervention, no?

That would be nice, but if there really is a problem with a scan that
would not be good as that would trigger the very problem that the
cache is supposed to avoid.

But I don't understand why we need it in the first place, except as a
list of pools to import at boot.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nexsan SATABeast and ZFS

2009-03-10 Thread Mattias Pantzare
On Tue, Mar 10, 2009 at 23:57, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Tue, 10 Mar 2009, Moore, Joe wrote:

 As far as workload, any time you use RAIDZ[2], ZFS must read the entire
 stripe (across all of the disks) in order to verify the checksum for that
 data block.  This means that a 128k read (the default zfs blocksize)
 requires a 32kb read from each of 6 disks, which may include a relatively
 slow seek to the relevant part of the spinning rust.  So for random I/O,
 even though the data is striped

 This is not quite true.  Raidz2 is not the same as RAID6.  ZFS has an
 independent checksum for its data blocks.  The traditional RAID type
 technology is used to repair in case data corruption is detected.

What he is saying is true. RAIDZ will spread blocks on all disks, and
therefore requires full stripe reads to read the block. The good thing
is that it will always do full stripe writes so writes are fast.

RAID6 has no blocks so you can read any sector by reading from 1 disk,
you only have to read from the other disks in the stipe in case of a
fault.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs streams data corruption

2009-02-24 Thread Mattias Pantzare
On Tue, Feb 24, 2009 at 19:18, Nicolas Williams
nicolas.willi...@sun.com wrote:
 On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote:
 I recently read up on Scott Dickson's blog with his solution for
 jumpstart/flashless cloning of ZFS root filesystem boxes.  I have to say
 that it initially looks to work out cleanly, but of course there are
 kinks to be worked out that deal with auto mounting filesystems mostly.

 The issue that I'm having is that a few days after these cloned systems
 are brought up and reconfigured they are crashing and svc.configd
 refuses to start.

 When you snapshot a ZFS filesystem you get just that -- a snapshot at
 the filesystem level.  That does not mean you get a snapshot at the
 _application_ level.  Now, svc.configd is a daemon that keeps a SQLite2
 database.  If you snapshot the filesystem in the middle of a SQLite2
 transaction you won't get the behavior that you want.

 In other words: quiesce your system before you snapshot its root
 filesystem for the purpose of replicating that root on other systems.

That would be a bug in ZFS or SQLite2.

A snapshoot should be an atomic operation. The effect should be the
same as power fail in the meddle of an transaction and decent
databases can cope with that.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-12 Thread Mattias Pantzare

 Right, well I can't imagine it's impossible to write a small app that can
 test whether or not drives are honoring correctly by issuing a commit and
 immediately reading back to see if it was indeed committed or not.  Like a
 zfs test cXtX.  Of course, then you can't just blame the hardware
 everytime something in zfs breaks ;)

A read of data in the disk cache will be read from the disk cache. You
can't tell the disk to ignore its cache and read directly from the
plater.

 The only way to test this is to write and the remove the power from
the disk. Not easy in software.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-10 Thread Mattias Pantzare
 What filesystem likes it when disks are pulled out from a LIVE
 filesystem? Try that on UFS and you're f** up too.

Pulling a disk from a live filesystem is the same as pulling the power
from the computer. All modern filesystems can handle that just fine.
UFS with logging on do not even need fsck.

Now if you have a disk that lies and don't write to the disk when it
should  all bets are off.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Max size of log device?

2009-02-08 Thread Mattias Pantzare
On Sun, Feb 8, 2009 at 22:12, Vincent Fox vincent_b_...@yahoo.com wrote:
 Thanks I think I get it now.

 Do you think having log on a 15K RPM drive with the main pool composed of 10K 
 RPM drives will show worthwhile improvements?  Or am I chasing a few 
 percentage points?

 I don't have money for new hardware  SSD.  Just recycling some old 
 components here are and there are a few 15K RPM drives on the shelf I thought 
 I could throw strategically into the mix.

 Application will likely be NFS serving.  Might use same setup for a 
 list-serve system which does have local storage for archived emails etc.

The 3310 has battery backed write cache, that is faster than any disk.
You might get more from the cache if you use it only for the log.

The RPM of the disks used for the log is not important when you have a
RAM write cache in front of the disk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Alternatives to increading the number of copies on a ZFS snapshot

2009-02-07 Thread Mattias Pantzare
On Sat, Feb 7, 2009 at 19:33, Sriram Narayanan sri...@belenix.org wrote:
 How do I set the number of copies on a snapshot ? Based on the error
 message, I believe that I cannot do so.
I already have a number of clones based on this snapshot, and would
 like the snapshot to have more copies now.
For higher redundancy and peace of mind, what alternatives do I have ?

You have to set the number of copies before you write the file.
Snapshots won't write anything so you can't change that on snapshots.

Your best option (and only if you value your data) is mirroring. (zpool attach)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why does a file on a ZFS change sizes?

2009-02-03 Thread Mattias Pantzare
On Tue, Feb 3, 2009 at 20:55, SQA sqa...@gmail.com wrote:
 I set up a ZFS system on a Linux x86 box.

 [b] zpool history

 History for 'raidpool':
 2009-01-15.17:12:48 zpool create -f raidpool raidz1 c4t1d0 c4t2d0 c4t3d0 
 c4t4d0 c4t5d0
 2009-01-15.17:15:54 zfs create -o mountpoint=/vol01 -o sharenfs=on -o 
 canmount=on raidpool/vol01[/b]

 I did not make the export (vol01) into a volume. I know you can set default 
 blocksizes when you create volumes but you cannot make them exportable NFS 
 exports.  Thus, I did not make the NFS exports into volumes and I did not 
 specify a blocksize on the NFS exports.

 I am assuming that vol01 is using variable blocksizes because I did not 
 explicitly specify a blocksize. Thus, my assumption is that ZFS would use a 
 blocksize that is the the smallest power of 2 and the smallest blocksize is 
 512 bytes while the biggest would be 128k

 I use the stat command to check the filesize, the blocksize, and the # of 
 blocks.

 I created a file that is exactly 512 bytes in size on /vol01

 I do the following stat command:
 [b]stat --printf %n %b %B %s %o\n * [/b]

 The %b is the number of blocks used, %B is the blocksize.

 The number of blocks changes after a few minutes after the file is created:

 # stat --printf %n %b %B %s %o\n *
 file.512 [b]1[/b] 512 512 4096
 # stat --printf %n %b %B %s %o\n *
 file.512 [b]1[/b] 512 512 4096
 # stat --printf %n %b %B %s %o\n *
 file.512 [b]1[/b] 512 512 4096

 Q1) Why does the # of blocks change after a few minutes? And why are we using 
 3 blocks when the file is only 512 bytes in size (in other words, only 1 
 block is needed)???  This makes it seem that the minimum blocksize isn't 512 
 bytes but 1536 bytes.

You probably have a cut'n'paste error as all block numbers are 1 in
your example.

My guess is that the number of blocks are updated every 5 seconds.


 Q2) Is there a way to force ZFS to use 512 blocksizes?  That means that if a 
 file is 512 bytes in size or smaller, it should only use 512 bytes -- the 
 number of blocks it uses should be 1.

It is, or at least is is on my solaris system. But it has to store
metadata in one block. Try creating a 600 byte file and it should use
one more 512 byte block.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on partitions

2009-01-14 Thread Mattias Pantzare
On Wed, Jan 14, 2009 at 20:03, Tim t...@tcsac.net wrote:


 On Tue, Jan 13, 2009 at 6:26 AM, Brian Wilson bfwil...@doit.wisc.edu
 wrote:

 Does creating ZFS pools on multiple partitions on the same physical drive
 still run into the performance and other issues that putting pools in slices
 does?


 Is zfs going to own the whole drive or not?  The *issue* is that zfs will
 not use the drive cache if it doesn't own the whole disk since it won't know
 whether or not it should be flushing cache at any given point in time.

ZFS will always flush the disk cache at appropriate times. If ZFS
thinks that is alone it will turn the write cache on the disk on.


 It could cause corruption if you had UFS and zfs on the same disk.

It is safe to have UFS and ZFS on the same disk and it has always been safe.

Write cache on the disk is not safe for UFS, that is why zfs will turn
it on only if it is alone.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on partitions

2009-01-14 Thread Mattias Pantzare

 ZFS will always flush the disk cache at appropriate times. If ZFS
 thinks that is alone it will turn the write cache on the disk on.

 I'm not sure if you're trying to argue or agree.  If you're trying to argue,
 you're going to have to do a better job than zfs will always flush disk
 cache at appropriate times, because that's outright false in the case where
 zfs doesn't own the entire disk.  That flush may very well produce an
 outcome zfs could never pre-determine.

You can send flush cache commands to the disk how often you wish, the
only thing that happens is that the disk writes dirty sectors from its
cache to the disk. That is, no writes will be done that should not
have happend at some time anyway. This will not harm UFS or any other
user of the disk. Other users can issue flush cache command without
affecting ZFS. Please read up on what the flus cache command does!

ZFS will send flush cache commands even when it is not alone on the
disk. There are many disks with write cache on by default. There have
even been disks that won't turn it off even if told so.


  It could cause corruption if you had UFS and zfs on the same disk.

 It is safe to have UFS and ZFS on the same disk and it has always been
 safe.

 ***unless you turn on write cache.  And without write cache, performance
 sucks.  Hence me answering the OP's question.

There was no mention of cache at all in the question.

It was not clear that this sentence reffered to your own text, hence
the misunderstanding:

It could cause corruption if you had UFS and zfs on the same disk.

I read that as a separate statement.


As to the performance sucks, that is putting it a bit harsh, you will
get better performance with write cache but the system will be
perfectly usable without.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mount ZFS pool on different system

2009-01-03 Thread Mattias Pantzare
 Now I want to mount that external zfs hdd on a different notebook running 
 solaris and
 supporting zfs as well.

 I am unable to do so. If I'd run zpool create, it would wipe out my external 
 hdd what I of
 course want to avoid.

 So how can I mount a zfs filesystem on a different machine without destroying 
 it?


zpool import
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?

2008-12-30 Thread Mattias Pantzare
On Tue, Dec 30, 2008 at 11:30, Carsten Aulbert
carsten.aulb...@aei.mpg.de wrote:
 Hi Marc,

 Marc Bevand wrote:
 Carsten Aulbert carsten.aulbert at aei.mpg.de writes:
 In RAID6 you have redundant parity, thus the controller can find out
 if the parity was correct or not. At least I think that to be true
 for Areca controllers :)

 Are you sure about that ? The latest research I know of [1] says that
 although an algorithm does exist to theoretically recover from
 single-disk corruption in the case of RAID-6, it is *not* possible to
 detect dual-disk corruption with 100% certainty. And blindly running
 the said algorithm in such a case would even introduce corruption on a
 third disk.


 Well, I probably need to wade through the paper (and recall Galois field
 theory) before answering this. We did a few tests in a 16 disk RAID6
 where we wrote data to the RAID, powered the system down, pulled out one
 disk, inserted it into another computer and changed the sector checksum
 of a few sectors (using hdparm's utility makebadsector). The we
 reinserted this into the original box, powered it up and ran a volume
 check and the controller did indeed find the corrupted sector and
 repaired the correct one without destroying data on another disk (as far
 as we know and tested).

You are talking about diffrent types of errors. You tested errors that
the disk can detect. That is not a problem on any RAID, that is what
it is designed for.

He was talking about errors that the disk can't detect (errors
introduced by other parts of the system, writes to the wrong sector or
very bad luck). You can simulate that by writing diffrent data to the
sector,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Mattias Pantzare
 Interestingly, the size fields under top add up to 950GB without getting 
 to the bottom of the list, yet it
 shows NO swap being used, and 150MB free out of 768 of RAM!  So how can the 
 size of the existing processes
 exceed the size of the virtual memory in use by a factor of 2, and the size 
 of total virtual memory by a factor of 1.5?
 This is not the resident size - this is the total size!

Size is how much address space the process has allocated. Part of that
is executables and shared libraries (they are backed by the file, not
by swap). A large portion of that is shared, the same memory is used
by many processes. Processes can also allocate shared memory by other
means.


Memory is not a big problem for ZFS, address space is. You may have to
give the kernel more address space on 32-bit CPUs.

eeprom kernelbase=0x8000

This will reduce the usable address space of user processes though.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Mattias Pantzare
 If the critical working set of VM pages is larger than available
 memory, then the system will become exceedingly slow.  This is
 indicated by a substantial amount of major page fault activity.
 Since disk is 10,000 times slower than RAM, major page faults can
 really slow things down dramatically.  Imagine what happens if ZFS or
 an often-accessed part of the kernel is not able to fit in available
 RAM.

ZFS and most of the kernel is locked in physical memory. Swap is never
used for ZFS.

In this case (NFS) everything is done in kernel. working set can not
be larger than available memory.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Mattias Pantzare
On Sat, Nov 29, 2008 at 22:19, Ray Clark [EMAIL PROTECTED] wrote:
 Pantzer5:  Thanks for the top  size explanation.

 Re: eeprom kernelbase=0x8000
 So this makes the kernel load at the 2G mark?  What is the default, something 
 like C00... for 3G?

Yes on both questions (i have not checked the hex conversions).

This might not be your problem, but it is easy to test. My symptom was
that zpool scrub made the computer go slower and slower and finally
just stop. But this was a long time ago so this might not be a problem
today.


 Are PCI and AGP space in there too, such that kernel space is 4G - 
 (kernelbase + PCI_Size + AGP_Size) ?  (Shot in the dark)?

No.

This is virtual memory.

The big difference in memory usage between UFS and ZFS is that ZFS
will have all data it caches mapped in the kernel address space. UFS
leaves data unmapped.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Mattias Pantzare
On Sun, Nov 30, 2008 at 00:04, Bob Friesenhahn
[EMAIL PROTECTED] wrote:
 On Sat, 29 Nov 2008, Mattias Pantzare wrote:

 The big difference in memory usage between UFS and ZFS is that ZFS
 will have all data it caches mapped in the kernel address space. UFS
 leaves data unmapped.

 Another big difference I have heard about is that Solaris 10 on x86 only
 uses something like 64MB of filesystem caching by default for UFS.  This is
 different than SPARC where the caching is allowed to grow.  I am not sure if
 OpenSolaris maintains this arbitrary limit for x86.

That is not true. I doubt that any Solaris version had that type of limit.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Mattias Pantzare
On Sun, Nov 30, 2008 at 01:10, Bob Friesenhahn
[EMAIL PROTECTED] wrote:
 On Sun, 30 Nov 2008, Mattias Pantzare wrote:

 Another big difference I have heard about is that Solaris 10 on x86 only
 uses something like 64MB of filesystem caching by default for UFS.  This
 is
 different than SPARC where the caching is allowed to grow.  I am not sure
 if
 OpenSolaris maintains this arbitrary limit for x86.

 That is not true. I doubt that any Solaris version had that type of limit.

 What is what I heard Jim Mauro tell us.  I recall feeling a bit disturbed
 when I heard it.  If it is true, perhaps it applies only to x86 32 bits,
 which has obvious memory restrictions.  I recall that he showed this
 parameter via DTrace. However on my Solaris 10U5 AMD64 system I see this
 limit:

 429293568   maximum memory allowed in buffer cache (bufhwm)

 which seems much higher than 64MB.  The Solaris Tuning And Tools book says
 that by default the buffer cache is allowed to grow to 2% of physical
 memory.

 Obtain the value via

  sysdef | grep bufhwm

 My 32-bit Belenix system running under VirtualBox with 2GB allocated to the
 VM reports a value of 41,762,816.

That is only a small part of the cache used for file system metadata.
File data caching  is integrated in the normal memory management.

http://docs.sun.com/app/docs/doc/817-0404/chapter2-37?a=view
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] continuous replication

2008-11-14 Thread Mattias Pantzare
 I think you're confusing our clustering feature with the remote
 replication feature. With active-active clustering, you have two closely
 linked head nodes serving files from different zpools using JBODs
 connected to both head nodes. When one fails, the other imports the
 failed node's pool and can then serve those files. With remote
 replication, one appliance sends filesystems and volumes across the
 network to an otherwise separate appliance. Neither of these is
 performing synchronous data replication, though.

That is _not_ active-active, that is active-passive.

If you have a active-active system I can access the same data via both
controllers at the same time. I can't if it works like you just
described. You can't call it active-active just because different
volumes are controlled by different controllers. Most active-passive
RAID controllers can do that.

The data sheet talks about active-active clusters, how does that work?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] continuous replication

2008-11-14 Thread Mattias Pantzare
On Sat, Nov 15, 2008 at 00:46, Richard Elling [EMAIL PROTECTED] wrote:
 Adam Leventhal wrote:

 On Fri, Nov 14, 2008 at 10:48:25PM +0100, Mattias Pantzare wrote:


 That is _not_ active-active, that is active-passive.

 If you have a active-active system I can access the same data via both
 controllers at the same time. I can't if it works like you just
 described. You can't call it active-active just because different
 volumes are controlled by different controllers. Most active-passive
 RAID controllers can do that.

 The data sheet talks about active-active clusters, how does that work?


 What the Sun Storage 7000 Series does would more accurately be described
 as
 dual active-passive.


 This is ambiguous in the cluster market.  It is common to describe
 HA clusters where each node can be offering services concurrently,
 as active/active, even though the services themselves are active/passive.
 This is to appease folks who feel that idle secondary servers are a bad
 thing.

But this product is not in the cluster market. It is in the storage market.

By your definition virtually all dual controller RAID boxes are active/active.

You should talk to Veritas so that they can change all their documentation...

Active/active and active/passive has a real technical meaning, don't
let marketing destroy that!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on Fit-PC Slim?

2008-11-06 Thread Mattias Pantzare
 Planning to stick in a 160-gig Samsung drive and use it for lightweight 
 household server.  Probably some Samba usage, and a tiny bit of Apache  
 RADIUS.   I don't need it to be super-fast, but slow as watching paint dry 
 won't

 You know that you need a minimum of 2 disks to form a (mirrored) pool
 with ZFS?  A pool with no redundancy is not a good idea!

My pools with no redundancy is working very fine. Redundancy is better
but you can certainly run without.  You should do backups in all
cases.


 work either.   Just curious if anyone else has tried something similar 
 everything I  read says ZFS wants 1-gig RAM but don't say what size of 
 penalty I would pay
 for having less.  I could run Linux on it of course but now prefer to remain 
 free of  the tyranny of fsck.

 I  don't think that there is enough CPU horse-power on this platform
 to run OpenSolaris - and you need approx 768Kb (3/4 of a Gb) of RAM
 just to install it.  After that OpenSolaris will only increase in size
 over time   To try to run it as a ZFS server would be madness -
 worse than watching paint dry.

I don't know about the CPU but 1Gb RAM on a home server works fine.
I even have a 256Mb debian in virtualbox on my server with 1Gb RAM.

Just turn X11 off. (/usr/dt/bin/dtconfig -d)

The installation have a higher RAM requirement than the installed
system as you can't have swap for the installation.

Before ZFS solaris has improved its RAM usage for every release.

Workstations are a different matter.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recommendations on adding vdev to raidz zpool

2008-10-26 Thread Mattias Pantzare
On Sun, Oct 26, 2008 at 5:31 AM, Peter Baumgartner [EMAIL PROTECTED] wrote:
 I have a 7x150GB drive (+1 spare) raidz pool that I need to expand.
 There are 6 open drive bays, so I bought 6 300GB drives and went to
 add them as a raidz vdev to the existing zpool, but I didn't realize
 the raidz vdevs needed to have the same number of drives. (why is
 that?)

They do not have to have the same number of drivers, you can even mix
raidz and plain
disks. That is more a recommendation. Add -f to the command.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recommendations on adding vdev to raidz zpool

2008-10-26 Thread Mattias Pantzare
On Sun, Oct 26, 2008 at 3:00 PM, Peter Baumgartner [EMAIL PROTECTED] wrote:
 On Sun, Oct 26, 2008 at 4:02 AM, Mattias Pantzare [EMAIL PROTECTED] wrote:
 On Sun, Oct 26, 2008 at 5:31 AM, Peter Baumgartner [EMAIL PROTECTED] wrote:
 I have a 7x150GB drive (+1 spare) raidz pool that I need to expand.
 There are 6 open drive bays, so I bought 6 300GB drives and went to
 add them as a raidz vdev to the existing zpool, but I didn't realize
 the raidz vdevs needed to have the same number of drives. (why is
 that?)

 They do not have to have the same number of drivers, you can even mix
 raidz and plain
 disks. That is more a recommendation. Add -f to the command.

 What is the risk of creating a pool consisting of two raidz vdevs that
 don't have the same number of disks?

Slightly different reliability and performance on different parts of
the pool. Nothing to worry about in your case.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver speed.

2008-10-08 Thread Mattias Pantzare
On Wed, Oct 8, 2008 at 10:29 AM, Ross [EMAIL PROTECTED] wrote:
 bounce

 Can anybody confirm how bug 6729696 is going to affect a busy system running 
 synchronous NFS shares?  Is the sync activity from NFS
 going to be enough to prevent resilvering from ever working, or have I 
 mis-understood this bug?

A synchronous write will not trigger a sync. The ZIL is used for
synchronous writes.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I add my own Attributes to a ZAP object, and then search on it?

2008-09-13 Thread Mattias Pantzare
On Sun, Sep 14, 2008 at 12:37 AM, Anon K Adderlan
[EMAIL PROTECTED] wrote:
 How do I add my own Attributes to a ZAP object, and then search on it?

 For example, I want to be able to attach the gamma value to each image, and 
 be able to search and
 sort them based on it.  From reading the on disk format documentation I've 
 been led to believe that this
 would be done through ZAP objects, but what I really need is a reference to 
 the C/C++ or Shell API, and
 whoever put together the Administration Guide has for some reason decided 
 that the code segments should
 be in a white font on a white background.

You can't access ZFS internals from applications.

A database sounds like the right solution, but you could use extended
attributes.

This is from What's New in the Solaris 9 8/03 Operating Environment,
http://docs.sun.com/app/docs/doc/817-0493/6mg9pruau?a=view

Extended File Attributes

The UFS, NFS, and TMPFS file systems have been enhanced to include
extended file attributes. Application developers can associate
specific attributes to a file. For example, a developer of a file
management application for a windowing system might choose to
associate a display icon with a file.

Extended attributes are logically represented as files within a hidden
directory that is associated with the target file.

You can use the extended file attribute API and a set of shell
commands to add and manipulate file system attributes. See the
fsattr(5), openat(2), and runat(1) man pages for more information.

Many file system commands in Solaris provide an attribute-aware option
that you can use to query, copy, modify, or find file attributes. For
more information, see the specific file system command in the man
pages.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-27 Thread Mattias Pantzare
2008/8/27 Richard Elling [EMAIL PROTECTED]:

 Either the drives should be loaded with special firmware that
 returns errors earlier, or the software LVM should read redundant data
 and collect the statistic if the drive is well outside its usual
 response latency.


 ZFS will handle this case as well.


 How is ZFS handling this? Is there a timeout in ZFS?


 Not for this case, but if configured to manage redundancy, ZFS will
 read redundant data from alternate devices.

No, ZFS will not, ZFS waits for the device driver to report an error,
after that it will read from alternate devices.

ZFS could detect that there is probably a problem with the device and
read from an alternate device much faster while it waits for the
device to answer.

You can't do this at any other level than ZFS.



  One thing other LVM's seem like they may do better
 than ZFS, based on not-quite-the-same-scenario tests, is not freeze
 filesystems unrelated to the failing drive during the 30 seconds it's
 waiting for the I/O request to return an error.



 This is not operating in ZFS code.


 In what way is freezing a ZFS filesystem not operating in ZFS code?

 Notice that he wrote filesystems unrelated to the failing drive.



 At the ZFS level, this is dictated by the failmode property.

But that is used after ZFS has detected an error?


 I find comparing unprotected ZFS configurations with LVMs
 using protected configurations to be disingenuous.

I don't think anyone is doing that.



 What is your definition of unrecoverable reads?


 I wrote data, but when I try to read, I don't get back what I wrote.

There is only one case where ZFS is better, that is when wrong data is
returned. All other cases are managed by layers below ZFS. Wrong data
returned is not normaly called unrecoverable reads.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-26 Thread Mattias Pantzare
2008/8/26 Richard Elling [EMAIL PROTECTED]:

 Doing a good job with this error is mostly about not freezing
 the whole filesystem for the 30sec it takes the drive to report the
 error.

 That is not a ZFS problem.  Please file bugs in the appropriate category.

Who's problem is it? It can't be the device driver as that has no
knowledge of zfs
filesystems or redundancy.


 Either the drives should be loaded with special firmware that
 returns errors earlier, or the software LVM should read redundant data
 and collect the statistic if the drive is well outside its usual
 response latency.

 ZFS will handle this case as well.

How is ZFS handling this? Is there a timeout in ZFS?


  One thing other LVM's seem like they may do better
 than ZFS, based on not-quite-the-same-scenario tests, is not freeze
 filesystems unrelated to the failing drive during the 30 seconds it's
 waiting for the I/O request to return an error.


 This is not operating in ZFS code.

In what way is freezing a ZFS filesystem not operating in ZFS code?

Notice that he wrote filesystems unrelated to the failing drive.



 In terms of FUD about ``silent corruption'', there is none of it when
 the drive clearly reports a sector is unreadable.  Yes, traditional
 non-big-storage-vendor RAID5, and all software LVM's I know of except
 ZFS, depend on the drives to report unreadable sectors.  And,
 generally, drives do.  so let's be clear about that and not try to imply
 that the ``dominant failure mode'' causes silent corruption for
 everyone except ZFS and Netapp users---it doesn't.


 In my field data, the dominant failure mode for disks is unrecoverable
 reads.  If your software does not handle this case, then you should be
 worried.  We tend to recommend configuring ZFS to manage data
 redundancy for this reason.

He is writing that all software LVM's will handle unrecoverable reads.

What is your definition of unrecoverable reads?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corrupt zfs stream? checksum mismatch

2008-08-13 Thread Mattias Pantzare
2008/8/13 Jonathan Wheeler [EMAIL PROTECTED]:
 So far we've established that in this case:
 *Version mismatches aren't causing the problem.
 *Receiving across the network isn't the issue (because I have the exact same 
 issue restoring the stream directly on
 my file server).
 *All that's left was the initial send, and since zfs guarantees end to end 
 data integrity, it should have been able to deal
 with any network possible randomness in the middle (zfs on both ends) - or at 
 absolute worst, the zfs send command
 should have failed, if it encountered errors. Seems fair, no?

 So, is there a major bug here, or at least an oversight in the zfs send part 
 of the code?
 Does zfs send not do checksumming, or, verification after sending? I'm not 
 sure how else to interpret this data.

zfs send can't do any verification after sending. It is sending to a
pipe, it does not know that it is writing to a file. ZFS receive can
verify the data, as you know.

ZFS is not involved in moving the data over the network when you are using NFS.

There are many places where data can get corrupt even when you are
using ZFS.  Non ECC memory is one example.

There might be a bug in zfs but that is hard to check as you can't
reproduce the problem.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corrupt zfs stream? checksum mismatch

2008-08-12 Thread Mattias Pantzare
2008/8/10 Jonathan Wheeler [EMAIL PROTECTED]:
 Hi Folks,

 I'm in the very unsettling position of fearing that I've lost all of my data 
 via a zfs send/receive operation, despite ZFS's legendary integrity.

 The error that I'm getting on restore is:
 receiving full stream of faith/[EMAIL PROTECTED] into Z/faith/[EMAIL 
 PROTECTED]
 cannot receive: invalid stream (checksum mismatch)

 Background:
 I was running snv_91, and decided to upgrade to snv_95 converting to the much 
 awaited zfs-root in the process.

You could try to restore on a snv_91 system. zfs send streams is not
for backups. This is from the zfs man page:

The format of the stream is evolving. No backwards  com-
patibility is guaranteed. You may not be able to receive
your streams on future versions of ZFS.

Or the file was corrupted when you transfered it.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Block unification in ZFS

2008-08-05 Thread Mattias Pantzare
 Therefore, I wonder if something like block unification (which seems to be
 an old idea, though I know of it primarily through Venti[1]) would be useful
 to ZFS.  Since ZFS checksums all of the data passing through it, it seems
 natural to hook those checksums and have a hash table from checksum to block
 pointer. It would seem that one could write a shim vdev which used the ZAP
 and a host vdev to store this hash table and could inform the higher
 layers that, when writing a block, that they should simply alias an earlier
 block (and increment its reference count -- already there for snapshots --
 appropriately; naturally if the block's reference count becomes zero, its
 checksum should be deleted from the hash).


De duplication has been discussed many times, but it is not trivial to
implement.

There are no reference counts for blocks.Blocks have a time stamp that
is compared to the creation time of snapshots to work out if it can be
freed when you destroy a snapshot.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-28 Thread Mattias Pantzare
 4. While reading an offline disk causes errors, writing does not!
*** CAUSES DATA LOSS ***

 This is a big one:  ZFS can continue writing to an unavailable pool.  It 
 doesn't always generate errors (I've seen it copy over 100MB
 before erroring), and if not spotted, this *will* cause data loss after you 
 reboot.

 I discovered this while testing how ZFS coped with the removal of a hot plug 
 SATA drive.  I knew that the ZFS admin tools were
 hanging, but that redundant pools remained available.  I wanted to see 
 whether it was just the ZFS admin tools that were failing,
 or whether ZFS was also failing to send appropriate error messages back to 
 the OS.


This is not unique for zfs. If you need to know that your writes has
reached stable store you have to call fsync(). It is not enough to
close a file. This is true even for UFS, but UFS won't delay writes
for all operations so you will notice faster. But you will still loose
data.

I have been able to undo rm -rf / on a FreeBSD system by pulling the
power cord before it wrote the changes...

Databases use fsync (or similar) before they close a transaction, that
one of the reasons that databases like hardware write caches.
cp will not.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] copying a ZFS

2008-07-20 Thread Mattias Pantzare
2008/7/20 James Mauro [EMAIL PROTECTED]:
 Is there an optimal method of making a complete copy of a ZFS, aside from the 
 conventional methods (tar, cpio)?

 We have an existing ZFS that was not created with the optimal recordsize.
 We wish to create a new ZFS with the optimal recordsize (8k), and copy
 all the data from the existing ZFS to the new ZFS.

 Obviously, we know how to do this using conventional utilities and commands.

 Is there a ZFS-specific method for doing that beats the heck of out tar, etc?
 (RTFM indicates there is not; I R'd the FM :^).

Use zfs send | zfs receive if you wish to keep your snapshots or if
you will be doing the copy several times. You can send just the
changes between two snapshots.

(zfs send is in the FM :-)


 This may or may not be a copy to the same zpool, and I'd also be interested in
 knowing of that makes a difference (I do not think it does)?

It does not.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Confusion with snapshot send-receive

2008-06-21 Thread Mattias Pantzare
2008/6/21 Andrius [EMAIL PROTECTED]:
 Hi,
 there is a small confusion with send receive.

 zfs andrius/sounds was snapshoted @421 and should be copied to new zpool
 beta that on external USB disk.
 After
 /usr/sbin/zfs send andrius/[EMAIL PROTECTED] | ssh host1 /usr/sbin/zfs recv 
 beta
 or
 usr/sbin/zfs send andrius/[EMAIL PROTECTED] | ssh host1 /usr/sbin/zfs recv
 beta/sounds
 answer come
 ssh: host1: node name or service name not known

 What has been done bad?

There is no computer named host1?
That is a ssh error message, start by checking the ssh part by itself.

If both zpools are on the same computer you don't have to use ssh.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Filesystem for each home dir - 10,000 users?

2008-06-07 Thread Mattias Pantzare

 The problem with that argument is that 10.000 users on one vxfs or UFS
 filesystem is no problem at all, be it /var/mail or home directories.
 You don't even need a fast server for that. 10.000 zfs file systems is
 a problem.

 So, if it makes you happier, substitute mail with home directories.


 If you feel strongly, please pile onto CR 6557894
 http://bugs.opensolaris.org/view_bug.do?bug_id=6557894
 If we continue to talk about it on the alias, we will just end up
 finding ways to solve the business problem using available
 technologies.

If I need to count useage I can use du. But if you can implement space
usage info on a per-uid basis you are not far from quota per uid...


 A single file system serving 10,000 home directories doesn't scale
 either, unless the vast majority are unused -- in which case it is a
 practical problem for much less than 10,000 home directories.
 I think you will find that the people who scale out have a better
 long-term strategy.

We have a file system (vxfs) that is serving 30,000 home directories.
Yes, most of those are unused, but we still have to have them as we
don't know when the student will use it.

If this where zfs we would have to create 30,000 filesystem. Every
file system has a cost in RAM and in performance.

So, in ufs or vxfs unused home directories costs close to nothing. In
zfs they have a very real cost.



 The limitations of UFS do become apparent as you try to scale
 to the size permitted with ZFS.  For example, the largest UFS
 file system supported is 16 TBytes, or 1/4 of a thumper.  So if you
 are telling me that you are serving 10,000 home directories in
 a 16 TByte UFS file system with quotas (1.6 GBytes/user?  I've
 got 16 GBytes in my phone :-), then I will definitely buy you a
 beer.  And aspirin.  I'll bring a calendar so we can measure the
 fsck time when the log can't be replayed.  Actually, you'd
 probably run out of inodes long before you filled it up.  I wonder
 how long it would take to run quotacheck?  But I digress.  Let's
 just agree that UFS won't scale well and the people who do
 serve UFS as home directories for large populations tend to use
 multiple file systems.

We have 30,000 accounts on a 1TByte file system. If we need to we
could make 16 1Tb file systems, no problem. But 30,000 file systems on
one server? Maybe not so good...

If we could lower the cost of a zfs file system to zero all would be
good for my usages.

The best thing to do is probably AFS on ZFS. AFS can handle many
volumes (file systems) and ZFS is very good at the storage.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Filesystem for each home dir - 10,000 users?

2008-06-06 Thread Mattias Pantzare
2008/6/6 Richard Elling [EMAIL PROTECTED]:
 Richard L. Hamilton wrote:
 A single /var/mail doesn't work well for 10,000 users
 either.  When you
 start getting into that scale of service
 provisioning, you might look at
 how the big boys do it... Apple, Verizon, Google,
 Amazon, etc.  You
 should also look at e-mail systems designed to scale
 to large numbers of
 users
 which implement limits without resorting to file
 system quotas.  Such
 e-mail systems actually tell users that their mailbox
 is too full rather
 than
 just failing to deliver mail.  So please, when we
 start having this
 conversation
 again, lets leave /var/mail out.


 I'm not recommending such a configuration; I quite agree that it is neither
 scalable nor robust.


 I was going to post some history of scaling mail, but I blogged it instead.
 http://blogs.sun.com/relling/entry/on_var_mail_and_quotas
  -- richard


The problem with that argument is that 10.000 users on one vxfs or UFS
filesystem is no problem at all, be it /var/mail or home directories.
You don't even need a fast server for that. 10.000 zfs file systems is
a problem.

So, if it makes you happier, substitute mail with home directories.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Filesystem for each home dir - 10,000 users?

2008-06-05 Thread Mattias Pantzare
 A single /var/mail doesn't work well for 10,000 users either.  When you
 start getting into that scale of service provisioning, you might look at
 how the big boys do it... Apple, Verizon, Google, Amazon, etc.  You

[EMAIL PROTECTED] /var/mail echo *|wc
   1   20632  185597

[EMAIL PROTECTED] /var/mail /usr/platform/sun4u/sbin/prtdiag
System Configuration:  Sun Microsystems  sun4u Sun Enterprise 220R (2 X UltraSPA
RC-II 450MHz)
System clock frequency: 113 MHz
Memory size: 2048 Megabytes

So, 10,000 mailaccounts on a new server is not a problem. Of course
depending on usage patterns.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] help with a BIG problem, can't import my zpool anymore

2008-05-23 Thread Mattias Pantzare
2008/5/24 Hernan Freschi [EMAIL PROTECTED]:
 I let it run while watching TOP, and this is what I got just before it hung. 
 Look at free mem. Is this memory allocated to the kernel? can I allow the 
 kernel to swap?

No, the kernel will not use swap for this.

But most of the memory used by the kernel is probably in caches that
should release memory when needed.

Is this a 32 or 64 bit system?

ZFS will sometimes use all kernel address space on a 32-bit system.

You can give the kernel more address space with this command (only on
32-bit system):
eeprom kernelbase=0x5000
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz in zfs questions

2008-03-07 Thread Mattias Pantzare
   2. in a raidz do all the disks have to be the same size?


 I think this one has been answered, but I'll add/ask this:  I'm not sure what 
 would
 happen if you had 3x 320gb and 3x 1tb in a 6 disk raidz array.  I know you'd 
 have a
 6 * 320gb array, but I don't know if the unused space on the 3x 1tb could be 
 made
 into another raidz array. If zfs is limited in this way, you could work 
 around it by making
 320gb and 1tb-320gb partitions on the 1tb disks.

You have to use partitions.


  FYI, if you have some data that doesn't really need to be redundant, you can
 simplify your setup with block copies and maybe get closer to raidz 
 efficiency than
 a straight mirror.  Then you could easily replace any disk any time with a 
 bigger one,
 or add a disk any time.  To do this, just make one file system with copies=2 
 and
 store important stuff there.  Store less important stuff in a copies=1 file 
 system.

That is bad advice. Both copies may end up on the same disk. zfs will
try to put your copies on diffrent disks, but it won't tell you if it
can't.

Use mirror or RAIDZ if your data is important!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz in zfs questions

2008-03-07 Thread Mattias Pantzare
2008/3/7, Paul Kraus [EMAIL PROTECTED]:
 On Thu, Mar 6, 2008 at 8:56 PM, MC [EMAIL PROTECTED] wrote:
1. In zfs can you currently add more disks to an existing raidz? This is 
 important to me
   as i slowly add disks to my system one at a time.
  
No, but solaris and linux raid5 can do this (in linux, grow with mdadm).


 Be aware that growing an SLVM / DiskSuite RAID5 doesn't really
  grow the RAID5 set, it just concats more components onto the end of
  it. If those components are mirrors then you still have redundancy, if
  they aren't then you don't for that data that ends up out there. I
  don't consider growing RIAD5 this was with DiskSuite a good way to go,
  short or long term.


No, that is not how it works. If you grow a RAID5 set the new data is
concatenated to the original RAID5, but it IS protected by the parity
in the RAID5 set. You have redundancy but not the best performance.

From the docs: http://docs.sun.com/app/docs/doc/816-4520/about-raid5-1?a=view

You can expand a RAID-5 volume by concatenating additional components
to the volume. Concatenating a new component to an existing RAID-5
volume decreases the overall performance of the volume because the
data on concatenations is sequential. Data is not striped across all
components. The original components of the volume have data and parity
striped across all components. This striping is lost for the
concatenated component. However, the data is still recoverable from
errors because the parity is used during the component I/O. The
resulting RAID-5 volume continues to handle a single component
failure.

Concatenated components also differ in the sense that they do not have
parity striped on any of the regions. Thus, the entire contents of the
component are available for data.

Any performance enhancements for large or sequential writes are lost
when components are concatenated.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs 32bits

2008-03-06 Thread Mattias Pantzare
2008/3/6, Brian Hechinger [EMAIL PROTECTED]:
 On Thu, Mar 06, 2008 at 11:39:25AM +0100, [EMAIL PROTECTED] wrote:
  
   I think it's specfically problematic on 32 bit systems with large amounts
   of RAM.  Then you run out of virtual address space in the kernel quickly;
   a small amount of RAM (I have one with 512MB) works fine.


 I have a 32-bit machine with 4GB of ram.  I've been researching this for
  some time now, but can't find it anywhere.  At some point, someone posted
  a system config tweak to increase the amount of memory available to the
  ARC on a 32-bit platform.  Who was that, and could you please re-post that
  tweak?

I don't know how to change the ARC sise, but use this to increase
kernel addres space:

eeprom kernelbase=0x5000

Your user  address space will shrink when you do that.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Regression with ZFS best practice

2008-02-19 Thread Mattias Pantzare
 I've just put my first ZFS into production, and users are complaining about 
 some regressions.

 One problem for them is that now, they can't see all the users directories in 
 the automount point: the homedirs used to be part of a single UFS, and were 
 browsable with the correct autofs option. Now, following the ZFS 
 best-practice, each user has his own FS - but being all shared separately, 
 they're not browsable anymore.

 Is there a way to work around that, and have the same behaviour as before, 
 ie, all homedirs shown in /home, whether they're mounted or not?

Remove -nobrowse from the map in auto_master.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommendations for per-user NFS shared home directories?

2008-02-17 Thread Mattias Pantzare
2008/2/17, Bob Friesenhahn [EMAIL PROTECTED]:
 I am attempting to create per-user ZFS filesystems under an exported
 /home ZFS filesystem.  This would work fine except that the
 ownership/permissions settings applied to the mount point of those
 per-user filesystems on the server are not seen by NFS clients.
 Instead NFS clients see directory ownership of root:other (Solaris 9
 clients), root:wheel (OS-X clients), and root:daemon (FreeBSD
 clients).  Only Solaris 10 clients seem to preserve original ownership
 and permissions.


Have the clients mounted your per-user filesystems? It is not enough
to mount /home.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommendations for per-user NFS shared home directories?

2008-02-17 Thread Mattias Pantzare
2008/2/17, Bob Friesenhahn [EMAIL PROTECTED]:
 On Sun, 17 Feb 2008, Mattias Pantzare wrote:
 
  Have the clients mounted your per-user filesystems? It is not enough
  to mount /home.

 It is enough to mount /home if the client is Solaris 10.  I did not
 want to mess with creating per-user mounting for all of my different
 type of systems so I punted and all the users are in one filesystem.

 Probably the ZFS documentation which suggests creating per-user home
 directories should be updated so that the existing drawbacks are also
 known.

This is standard NFS behavior, this has nothing to do with ZFS.
Solaris has some new features in this area.

You should use automount for your mountings if you have many clients.
Change the automount map and all clients will mount the new filesystem
if needed. You can move some users to a new server with very little
work, just change the mapping for that user.
You should be able to get all your systems to read the automount maps
from NIS or LDAP.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 100% random writes coming out as 50/50 reads/writes

2008-02-15 Thread Mattias Pantzare
  
   If you created them after, then no worries, but if I understand
   correctly, if the *file* was created with 128K recordsize, then it'll
   keep that forever...


 Files have nothing to do with it.  The recordsize is a file system
  parameter.  It gets a little more complicated because the recordsize
  is actually the maximum recordsize, not the minimum.

Please read the manpage:

 Changing the file system's recordsize only affects files
 created afterward; existing files are unaffected.

Nothing is rewritten in the file system when you change recordsize so
is stays the same for existing files.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDz2 reporting odd (smaller) size

2008-02-13 Thread Mattias Pantzare
2008/2/13, Sam [EMAIL PROTECTED]:
 I saw some other people have a similar problem but reports claimed this was 
 'fixed in release 42' which is many months old, I'm running the latest 
 version.  I made a RAIDz2 of 8x500GB which should give me a 3TB pool:


Disk manufacturers use ISO units, where 1k is 1000. ZFS uses
computer units, where 1k is 1024. So your 500GB is realy 465GB.
Check the exact number with format.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover

2007-11-10 Thread Mattias Pantzare
2007/11/10, asa [EMAIL PROTECTED]:
 Hello all. I am working on an NFS failover scenario between two
 servers.  I am getting the stale file handle errors on my (linux)
 client which point to there being a mismatch in the fsid's of my two
 filesystems when the failover occurs.
 I understand that the fsid_guid attribute which is then used as the
 fsid in an NFS share, is created at zfs create time, but I would like
 to see and modify that value on any particular zfs filesystem after
 creation.

 More details were discussed at http://www.mail-archive.com/zfs-
 [EMAIL PROTECTED]/msg03662.html but this was talking about the
 same filesystem sitting on a san failing over between two nodes.

 On a linux NFS server one can specify in the nfs exports -o
 fsid=num which can be an arbitrary number, which would seem to fix
 this issue for me, but it seems to be unsupported on Solaris.

As the fsid is created when the file system is created it will be the
same when you mount it on a different NFS server. Why change it?

Or are you trying to match two different file systems? Then you also
have to match all inode-numbers on your files. That is not possible at
all.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Error: Volume size exceeds limit for this system

2007-11-09 Thread Mattias Pantzare
2007/11/9, Anton B. Rang [EMAIL PROTECTED]:
 The comment in the header file where this error is defined says:

   /* volume is too large for 32-bit system */

 So it does look like it's a 32-bit CPU issue.  Odd, since file systems don't 
 normally have any sort of dependence on the CPU type

This is not a file system limit, it is a device limit. A zfs volume is
a device, not a file system.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs/zpools iscsi

2007-10-12 Thread Mattias Pantzare
2007/10/12, Krzys [EMAIL PROTECTED]:
 Hello all, sorry if somebody already asked this or not. I was playing today 
 with
 iSCSI and I was able to create zpool and then via iSCSI I can see it on two
 other hosts. I was courious if I could use zfs to have it shared on those two
 hosts but aparently I was unable to do it for obvious reasons. On my linuc
 oracle rac I was using ocfs which works just as I need it, does anyone know if
 such could be acheived with zfs maybe? maybe if not now but in the future? is
 there anything that I could do at this moment to be able to have my two other
 solaris clients see my zpool that I am presenting via iscsi to them both? Is
 there any solutions out there of this kind?

Why not use NFS?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Multi-level ZFS in a SAN

2007-09-23 Thread Mattias Pantzare
2007/9/23, James L Baker [EMAIL PROTECTED]:
 I'm a small-time sysadmin with big storage aspirations (I'll be honest
 - for a planned MythTV back-end, and *ahem*, other storage), and I've
 recently discovered ZFS. I'm thinking about putting together a
 homebrew SAN with a NAS head, and am wondering if the following will
 work (hoping the formatting will stick!):


 SAN Box 1:
 8-disk raid-z2 -- iSCSI over GbE --+
   |
 SAN Box 2:   |NAS Head:
 8-disk raid-z2 -- iSCSI over GbE --+-- N-volume zfs pool -- NFS/SMB
   |
 SAN Box N:   |
 8-disk raid-z2 -- iSCSI over GbE --+


 In plain english, for each SAN box, combining 8 (or so) disks in a ZFS
 raid-z2 pool, sharing the pool over GbE via iSCSI, then combining it
 with other (similar) SAN volumes in a non-redundant zfs pool on the
 NAS head, working out the partitioning, quotas, etc there.

It would probably be better to iSCSI export the raw disks on the SAN
boxes to the NAS Head. Let the NAS head do raidz2. That will make it
easier to move disks between computers if you have to. Then you will
have a redundant zfs pool on the NAS head without loosing any disk
space.

You could do 3 way raidz so that you can loose any SAN box.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please help! ZFS crash burn in SXCE b70!

2007-08-30 Thread Mattias Pantzare
 The problems I'm experiencing are as follows:
 ZFS creates the storage pool just fine, sees no errors on the drives, and 
 seems to work great...right up until I attempt to put data on the drives.  
 After only a few moments of transfer, things start to go wrong.  The system 
 doesn't power off, it just beeps 4-5 times.  The X session dies and the 
 monitor turns off (doesn't drop back to a console).  All network access dies. 
  It seems that the system panics (is it called something else in 
 solaris-land?).  The HD access light stays on (though I can hear no drives 
 doing anything strenuous), and the CD light blinks.  This has happened two or 
 three times, every time I've tried to start copying data to the ZFS pool.   
 I've been transfering over the network, via SCP or NFS.

This could be a hardware problem. Bad powersuply for the load? Try
removing 2 of the large disks.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs space efficiency

2007-06-30 Thread Mattias Pantzare

2007/6/25, [EMAIL PROTECTED] [EMAIL PROTECTED]:


I wouldn't de-duplicate without actually verifying that two blocks were
actually bitwise identical.

Absolutely not, indeed.

But the nice property of hashes is that if the hashes don't match then
the inputs do not either.

I.e., the likelyhood of having to do a full bitwise compare is vanishingly
small; the likelyhood of it returning equal is high.


For this application (deduplication data) the likelihood of matching
hashes are very high. In fact it has to be, otherwise there would not
be any data to deduplicate.

In the cp example, all writes would have matching hashes and all need a verify.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs kills box, memory related?

2007-06-13 Thread Mattias Pantzare

2007/6/10, arb [EMAIL PROTECTED]:

Hello, I'm new to OpenSolaris and ZFS so my apologies if my questions are naive!

I've got solaris express (b52) and a zfs mirror, but this command locks up my 
box within 5 seconds:
% cmp first_4GB_file second_4GB_file

It's not just these two 4GB files, any serious work in the filesystem (but I 
suspect the larger the file the worse it gets) bring the box to its knees.

I've tried setting the maximum ARC size (seting c,p,c_max with mdb) but it 
doesn't help. Any other suggestions?


If this is on a 32-bit machine, you may be running out of virtual
memory for the kernel. You can try this, and reboot:
eeprom kernelbase=0x5000

This will limit your userspace processes to about 1Gb.

I would download a new DVD and do an upgrade from that.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss