Re: [zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard

2009-09-08 Thread Chris Csanady
2009/9/7 Ritesh Raj Sarraf r...@researchut.com:
 The Discard/Trim command is also available as part of the SCSI standard now.

 Now, if you look from a SAN perspective, you will need a little of both.
 Filesystems will need to be able to deallocate blocks and then the same 
 should be triggered as a SCSI Trim to the Storage Controller.
 For a virtualized environment, the filesystem should be able to punch holes 
 into virt image files.

 F_FREESP is only on XFS to my knowledge.

I found F_FREESP while looking through the OpenSolaris source, and it
is supported on all filesystems which implement VOP_SPACE.  (I was
initially investigating what it would take to transform writes of
zeroed blocks into block frees on ZFS.  Although it would not appear
to be too difficult, I'm not sure if it would be worth complicating
the code paths.)

 So how does ZFS tackle the above 2 problems ?

At least for file backed filesystems, ZFS already does its part.  It
is the responsibility of the hypervisor to execute the mentioned
fcntl(), wether it is triggered by a TRIM or whatever else.  ZFS does
not use TRIM itself, though it is not recommended to use it on top of
files anyway, nor is there a need for virtualization purposes.

It does appear that the ATA TRIM command should be used with great
care though, or avoided all together.  Not only does it need to wait
for the entire queue to empty, it can cause a delay of ~100ms if you
execute them without enough elapsed time.  (See the thread linked from
the article I mentioned.)

As far as I can tell, Solaris is missing the equivalent of a
DKIOCDISCARD ioctl().  Something like that should be implemented to
allow recovery of space on zvols and iSCSI backing stores. (Though,
the latter would require implementing the SCSI TRIM support as well,
if I understand correctly.)

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard

2009-09-07 Thread Chris Csanady
2009/9/7 Richard Elling richard.ell...@gmail.com:
 On Sep 7, 2009, at 10:20 AM, Bob Friesenhahn wrote:

 The purpose of the TRIM command is to allow the FLASH device to reclaim
 and erase storage at its leisure so that the writer does not need to wait
 for erasure once the device becomes full.  Otherwise the FLASH device does
 not know when an area stops being used.

 Yep, it is there to try and solve the problem of rewrites in a small area,
 smaller than the bulk erase size.  While it would be trivial to traverse the
 spacemap and TRIM the free blocks, it might not improve performance
 for COW file systems. My crystal ball says smarter flash controllers or a
 form of managed flash will win and obviate the need for TRIM entirely.
  -- richard

I agree with this sentiment, although I still look forward to it being obviated
by a better memory technology instead, like PRAM.  In any case, the ATA
TRIM command may not be so useful after all, as it can't be queued:

http://lwn.net/Articles/347511/

As an aside, after a bit of digging, I came across fcntl(F_FREESP).
This will at least allow you to put the sparse back into sparse files if you
so desire.  Unfortunately, I don't see any way to do this for a zvol.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-09-02 Thread Chris Csanady
2009/9/2 Eric Sproul espr...@omniti.com:

 Adam,
 Is it known approximately when this bug was introduced?  I have a system 
 running
 snv_111 with a large raidz2 pool and I keep running into checksum errors 
 though
 the drives are brand new.  They are 2TB drives, but the pool is only about 14%
 used (~250G/drive across 13 drives).  For a drive to develop hundreds of
 checksum errors at less than 20% capacity seems far above the expected error 
 rate.

This may be 6826470, which was present for some time, and fixed it
b114.  If you have replaced a device on b111, you will see a lot of
checksum errors, even after the resilver completes.  In fact, when I
scrubbed my pool it encountered so many that it transitioned the vdev
to a faulted state.  (I had to run zpool clear periodically in a loop
to allow it to finish.)

See the details at:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6826470

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] odd slog behavior on B70

2007-11-26 Thread Chris Csanady
On Nov 26, 2007 8:41 PM, Joe Little [EMAIL PROTECTED] wrote:
 I was playing with a Gigabyte i-RAM card and found out it works great
 to improve overall performance when there are a lot of writes of small
 files over NFS to such a ZFS pool.

 However, I noted a frequent situation in periods of long writes over
 NFS of small files. Here's a snippet of iostat during that period.
 sd15/sd16 are two iscsi targets, and sd17 is the iRAM card (2GB)

 [iostat output]

 During this time no operations can occur. I've attached the iRAM disk
 via a 3124 card. I've never seen a svc_t time of 0, and full wait and
 busy disk. Any clue what this might mean?

This sounds like 6566207: si3124 driver loses interrupts.  I have
observed similar behavior as a result of this bug.  Upgrading to build
71 or later should fix things.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread Chris Csanady
On Nov 19, 2007 10:08 PM, Richard Elling [EMAIL PROTECTED] wrote:
 James Cone wrote:
  Hello All,
 
  Here's a possibly-silly proposal from a non-expert.
 
  Summarising the problem:
 - there's a conflict between small ZFS record size, for good random
  update performance, and large ZFS record size for good sequential read
  performance
 

 Poor sequential read performance has not been quantified.

I think this is a good point.  A lot of solutions are being thrown
around, and the problems are only theoretical at the moment.
Conventional solutions may not even be appropriate for something like
ZFS.

The point that makes me skeptical is this: blocks do not need to be
logically contiguous to be (nearly) physically contiguous.  As long as
you reallocate the blocks close to the originals, chances are that a
scan of the file will end up being mostly physically contiguous reads
anyway.  ZFS's intelligent prefetching along with the disk's track
cache should allow for good performance even in this case.

ZFS may or may not already do this, I haven't checked.  Obviously, you
won't want to keep a years worth of snapshots, or run the pool near
capacity.  With a few minor tweaks though, it should work quite well.
Talking about fundamental ZFS design flaws at this point seems
unnecessary to say the least.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on CF/SSDs [was: ZFS - Use h/w raid or not?Thoughts.Considerations.]

2007-06-02 Thread Chris Csanady

On 6/2/07, Richard Elling [EMAIL PROTECTED] wrote:

Chris Csanady wrote:
 On 6/1/07, Frank Cusack [EMAIL PROTECTED] wrote:
 On June 1, 2007 9:44:23 AM -0700 Richard Elling [EMAIL PROTECTED]
 wrote:
 [...]
  Semiconductor memories are accessed in parallel.  Spinning disks are
  accessed
  serially. Let's take a look at a few examples and see what this looks
  like...
 
  Disk  iops bw   atime   MTBF   UER
  endurance
 
 -
  -
  SanDisk 32 GByte 2.5 SATA   7,450 67   0.11   2,000,000
 10^-20   ?
  SiliconSystems 8 GByte CF  500  8   2  4,000,000   10^-14
  2,000,000
 ...

 these are probably different technologies though?  if cf cards aren't
 generally fast, then the sata device isn't a cf card just with a
 different form factor.  or is the CF interface the limiting factor?

 also, isn't CF write very slow (relative to read)?  if so, you should
 really show read vs write iops.

 Most vendors don't list this, for obvious reasons.  SanDisk is honest
 enough to do so though, and the number is spectacularly bad: 15.

For the SanDisk 32 GByte 2.5 SATA, write bandwidth is 47 MBytes/s -- quite
respectable.


I was quoting the random write IOPS number at 4kB.  The theoretical
sequential write bandwidth is fine, but I don't think that 15 IOPS can
be considered respectable.

They also list the number at 512kB, and it is still only 16 IOPS.
This is probably an artifact of striping across a large number of
flash chips, each of which has a large page size.  It is unknown how
large a transfer is required to actually reach that respectable
sequential write performance, though it probably won't happen often,
if at all.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on CF/SSDs [was: ZFS - Use h/w raid or not?Thoughts.Considerations.]

2007-06-01 Thread Chris Csanady

On 6/1/07, Frank Cusack [EMAIL PROTECTED] wrote:

On June 1, 2007 9:44:23 AM -0700 Richard Elling [EMAIL PROTECTED]
wrote:
[...]
 Semiconductor memories are accessed in parallel.  Spinning disks are
 accessed
 serially. Let's take a look at a few examples and see what this looks
 like...

 Disk  iops bw   atime   MTBF   UER
 endurance
 -
 -
 SanDisk 32 GByte 2.5 SATA   7,450 67   0.11   2,000,000   10^-20   ?
 SiliconSystems 8 GByte CF  500  8   2  4,000,000   10^-14
 2,000,000
...

these are probably different technologies though?  if cf cards aren't
generally fast, then the sata device isn't a cf card just with a
different form factor.  or is the CF interface the limiting factor?

also, isn't CF write very slow (relative to read)?  if so, you should
really show read vs write iops.


Most vendors don't list this, for obvious reasons.  SanDisk is honest
enough to do so though, and the number is spectacularly bad: 15.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool, RaidZ how it spreads its disk load?

2007-05-07 Thread Chris Csanady

On 5/7/07, Tony Galway [EMAIL PROTECTED] wrote:

Greetings learned ZFS geeks  guru's,

Yet another question comes from my continued ZFS performance testing. This has 
to do with zpool iostat, and the strangeness that I do see.
I've created an eight (8) disk raidz pool from a Sun 3510 fibre array giving me 
a 465G volume.
# zpool create tp raidz c4t600 ... 8 disks worth of zpool
# zfs create tp/pool
# zfs set recordsize=8k tp/pool
# zfs set mountpoint=/pool tp/pool


This is a known problem, and is an interaction between the alignment
requirements imposed by RAID-Z and the small recordsize you have
chosen.  You may effectively avoid it in most situations by choosing a
RAID-Z strip width of 2^n+1.  For a fixed record size, this will work
perfectly well.

Even so, there will still be cases where small files will cause
problems for RAID-Z.  While it does not affect many people right now,
I think it will become a more serious issue when disks move to 4k
sectors.

I think the reason for the alignment constraint was to ensure that the
stranded space was accounted for, otherwise it would cause problems as
the pool fills up.  (Consider a 3 device RAID-Z, where only one data
sector and one parity sector are written; the third sector in that
stripe is essentially dead space.)

Would it be possible (or worthwhile) to make the allocator aware of
this dead space, rather than imposing the alignment requirements?
Something like a concept of tentatively allocated space in the
allocator, which would be managed based on the requirements of the
vdev.  Using such a mechanism, it could coalesce the space if possible
for allocations.  Of course, it would also have to convert the
misaligned bits back into tentatively allocated space when blocks are
freed.

While I expect this may require changes which would not easily be
backward compatible, the alignment on RAID-Z has always felt a bit
wrong.  While the more severe effects can be addressed by also writing
out the dead space, that will not address uneven placement of data and
parity across the stripes.

Any thoughts?

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAID-Z resilver broken

2007-04-11 Thread Chris Csanady

On 4/11/07, Marco van Lienen [EMAIL PROTECTED] wrote:


A colleague at work and I have followed the same steps, included
running a digest on the /test/file, on a SXCE:61 build today and
can confirm the exact same, and disturbing?, result.  My colleague
mentioned to me he has witnessed the same 'resilver' behavior on
builds 57 and 60.


Thank you for taking the time to confirm this.  Just as long as people
are aware of it, it shouldn't really cause much trouble.  Still, it
gave me quite a scare after replacing a bad disk.


I don't think these checksum errors are a good sign.
The sha1 digest on the file *does* show to be the same so the
question arises: is the resilver process truly broken (even though
in this test-case the test file does appear to unchanged based on
the sha1 digest) ?


ZFS still has good data, so this is not unexpected.  It is interesting
though that it managed to read all of the data without finding any bad
blocks.  I just tried this with a more complex directory structure,
and other variations, with the same result.  It is bizarre, but ZFS
only manages to use the good data in normal operation.

To see exactly what is damaged though, try the following instead.
After the resilver completes, zpool offline a known good device of the
RAID-Z.  Then, do a scrub or try to read the data.  Afterward, zpool
status -v will display a list of the damaged files, which is very
nice.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] RAID-Z resilver broken

2007-04-07 Thread Chris Csanady

In a recent message, I detailed the excessive checksum errors that
occurred after replacing a disk.  It seems that after a resilver
completes, it leaves a large number of blocks in the pool which fail
to checksum properly.  Afterward, it is necessary to scrub the pool in
order to correct these errors.

After some testing, it seems that this only occurs with RAID-Z.  The
same behavior can be observed on both snv_59 and snv_60, though I do
not have any other installs to test at the moment.

The following commands should reproduce this result in a small test pool.

Chris


mkdir /tmp/test
mkfile 64m /tmp/test/0 /tmp/test/1
zpool create test raidz /tmp/test/0 /tmp/test/1
mkfile 16m /test/file

zpool export test
rm /tmp/test/0
zpool import -d /tmp/test test
mkfile 64m /tmp/test/0
zpool replace test /tmp/test/0

# wait for the resilver to complete, and observe that it completes successfully
zpool status test

# scrub the pool
zpool scrub test

# watch the checksum errors accumulate as the scrub progresses
zpool status test
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Excessive checksum errors...

2007-04-05 Thread Chris Csanady

I have some further data now, and I don't think that it is a hardware
problem.  Half way through the scrub, I rebooted and exchanged the
controller and cable used with the bad disk.  After restarting the
scrub, it proceeded error free until about the point where it left
off, and then it resumed the exact same behavior.

Basically, almost exactly one fourth of the amount of data that is
read from the resilvered disk is written to the same disk.  This was
constant throughout the scrub.  Meanwhile, fmd writes
ereport.fs.zfs.io events to errlog, until the disk is full.

At this point, it seems as if the resilvering code in snv_60 is
broken, and one fourth of the data was not reconstructed properly.  I
have an iosnoop trace of the disk in question, if anyone is
interested.  I will try to make some sense of it, but that probably
won't happen today.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Excessive checksum errors...

2007-04-04 Thread Chris Csanady

After replacing a bad disk and waiting for the resilver to complete, I
started a scrub of the pool.  Currently, I have the pool mounted
readonly, yet almost a quarter of the I/O is writes to the new disk.
In fact, it looks like there are so many checksum errors, that zpool
doesn't even list them properly:

 pool: p
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
   attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
   using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress, 18.71% done, 2h17m to go
config:

   NAMESTATE READ WRITE CKSUM
   p   ONLINE   0 0 0
 raidz1ONLINE   0 0 0
   c2d0ONLINE   0 0 0
   c3d0ONLINE   0 0 0
   c5d0ONLINE   0 0 0
   c4d0ONLINE   0 0 231.5

errors: No known data errors

I assume that that should be followed by a K.  Is my brand new
replacement disk really returning gigabyte after gigabyte of silently
corrupted data?  I find that quite amazing, and I thought that I would
inquire here.  This is on snv_60.


Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Firewire/USB enclosures

2007-03-20 Thread Chris Csanady

It looks like the following bug is still open:

   6424510 usb ignores DKIOCFLUSHWRITECACHE

Until it is fixed, I wouldn't even consider using ZFS on USB storage.
Even so, not all bridge boards (Firewire included) implement this
command.  Unless you can verify that it functions correctly, it is
safer to avoid USB and Firewire all together, as you risk serious
corruption in the event of a power loss.  This holds true for any
filesystem.

Another good reason is that scrubs and rebuilds will take a long time.
Unfortunately, I don't think that port multipliers are yet supported
in the SATA framework, so probably the best bet is a large enclosure
with internal SATA disks.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Chris Csanady

2007/2/12, Frank Hofmann [EMAIL PROTECTED]:

On Mon, 12 Feb 2007, Peter Schuller wrote:

 Hello,

 Often fsync() is used not because one cares that some piece of data is on
 stable storage, but because one wants to ensure the subsequent I/O operations
 are performed after previous I/O operations are on stable storage. In these
 cases the latency introduced by an fsync() is completely unnecessary. An
 fbarrier() or similar would be extremely useful to get the proper semantics
 while still allowing for better performance than what you get with fsync().

 My assumption has been that this has not been traditionally implemented for
 reasons of implementation complexity.

 Given ZFS's copy-on-write transactional model, would it not be almost trivial
 to implement fbarrier()? Basically just choose to wrap up the transaction at
 the point of fbarrier() and that's it.

 Am I missing something?

How do you guarantee that the disk driver and/or the disk firmware doesn't
reorder writes ?

The only guarantee for in-order writes, on actual storage level, is to
complete the outstanding ones before issuing new ones.


This is true for NCQ with SATA, but SCSI also supports ordered tags,
so it should not be necessary.

At least, that is my understanding.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Chris Csanady

2007/2/12, Frank Hofmann [EMAIL PROTECTED]:

On Mon, 12 Feb 2007, Chris Csanady wrote:

 This is true for NCQ with SATA, but SCSI also supports ordered tags,
 so it should not be necessary.

 At least, that is my understanding.

Except that ZFS doesn't talk SCSI, it talks to a target driver. And that
one may or may not treat async I/O requests dispatched via its strategy()
entry point as strictly ordered / non-coalescible / non-cancellable.

See e.g. disksort(9F).


Yes, however, this functionality could be exposed through the target
driver.  While the implementation does not (yet) take full advantage
of ordered tags, linux does provide an interface to do this:

   http://www.mjmwired.net/kernel/Documentation/block/barrier.txt


From a correctness standpoint, the interface seems worthwhile, even if

the mechanisms are never implemented.  It just feels wrong to execute
a synchronize cache command from ZFS, when often that is not the
intention.  The changes to ZFS itself would be very minor.

That said, actually implementing the underlying mechanisms may not be
worth the trouble.  It is only a matter of time before disks have fast
non-volatile memory like PRAM or MRAM, and then the need to do
explicit cache management basically disappears.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re: [zfs-discuss] Cheap ZFS homeserver.

2007-01-19 Thread Chris Csanady

2007/1/19, [EMAIL PROTECTED] [EMAIL PROTECTED]:

 ACHI SATA ... probably look at Intel boards instead.

Whats ACHI ? I didnt see anything useful on google or wikipedia ... is it a
chipset ? The issue I take with intel is there chips are either grossly
power hungry/hot (anything pre-pentium M) or ungodly expensive (core, core
2). They dont have anything that competes with a 65W AM2 athlon64.


Oops, I seem to have transposed some characters while typing that.  It
is AHCI: Advanced Host Controller Interface.  Many hardware vendors
are standardizing on this specification for SATA interfaces.  The most
common ones are found in the Intel ICH6R, ICH7R, and ICH8R south
bridges, but others from VIA, Nvidia, SiS and Jmicron are planned or
available.

See http://www.opensolaris.org/os/community/device_drivers/projects/AHCI

The driver is fairly new and does not support much at present, but it
should be a safe bet in the future.  As for PCIe cards, I think the
only options are the the two port SiI3132 and Jmicron based cards.

Sorry about the typo.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re: [zfs-discuss] Re: Re: Re[2]: Re: Dead drives and ZFS

2006-11-14 Thread Chris Csanady

On 11/14/06, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Rainer,

Tuesday, November 14, 2006, 4:43:32 AM, you wrote:

RH Sorry for the delay...

RH No, it doesn't. The format command shows the drive, but zpool
RH import does not find any pools. I've also used the detached bad
RH SATA drive for testing; no go. Once a drive is detached, there
RH seems to be no (not enough?) information about the pool that allows import.

Aha, you did zpool detach - sorry I missed it. Then zpool import won't
show you any pools to import from such disk. I agree with you it would
be useful to do so.


After examining the source, it clearly wipes the vdev label during a detach.
I suppose it does this so that the machine can't get confused at a later date.
It would be nice if the detach simply renamed something, rather than
destroying the pool though.  At the very least, the manual page ought
to
reflect the destructive nature of the detach command.

That said, it looks as if the code only zeros the first uberblock, so the
data may yet be recoverable.  In order to reconstruct the pool, I think
you would need to replace the vdev labels with ones from another of
your mirrors, and possibly the EFI label so that the GUID matched.
Then, corrupt the first uberblock, and pray that it imports.  (It may be
necessary to modify the txg in the labels as well, though I have
already speculated enough...)

Can anyone say for certain?

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] snv_51 hangs

2006-11-14 Thread Chris Csanady

I have experienced two hangs so far with snv_51.  I was running snv_46
until recently, and it was rock solid, as were earlier builds.

Is there a way for me to force a panic?  It is an x86 machine, with
only a serial console.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snv_51 hangs

2006-11-14 Thread Chris Csanady

Thank you all for the very quick and informative responses.  If it
happens again, I will try to get a core out of it.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Dead drives and ZFS

2006-11-11 Thread Chris Csanady

On 11/11/06, Rainer Heilke [EMAIL PROTECTED] wrote:

Nope. I get no pools available to import. I think that detaching the drive 
cleared any pool information/headers on the drive, which is why I can't figure out a way 
to get the data/pool back.


Did you also export the original pool before you tried this?  I
believe it was said that you can't import a pool if one of the same
name already exists on the system.  (Of course, you should pull the
other disks as well, or it may not import the right pool.)

In any case, I don't think this is expected behavior.  It should be
possible to remove part of a mirror or simply pull a disk without
affecting the contents.  (Assuming that the pool is a single N-way
mirror vdev.)

The manual page for zpool offline indicates that no further attempts
are made to read or write the device, so the data should still be
there.  While it does not elaborate on the result of a zpool detach, I
would expect it to behave the same way, by leaving the data intact.

If it does not work that way, that seems like a serious bug.  Removing
a disk should not destroy a complete replica, wether it is through
zpool detach and attach, or zpool offline and replace.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] Re: Dead drives and ZFS

2006-11-11 Thread Chris Csanady

On 11/11/06, Robert Milkowski [EMAIL PROTECTED] wrote:

CC The manual page for zpool offline indicates that no further attempts
CC are made to read or write the device, so the data should still be
CC there.  While it does not elaborate on the result of a zpool detach, I
CC would expect it to behave the same way, by leaving the data intact.

He did use detach not offline.
Also I'm not sure offline works the way you describe (but I guess it
does). If it does 'zpool import' should show a pool to import however
I'm not sure if there won't be a problem with pool id (not pool name).


Perhaps I have confused the issue of identical pool id and identical
pool names.  Still, I expect there will be issues trying to import an
orphaned part of an existing pool.  This seems like an area which
could use a bit of work.

While a single mirror vdev pool is a corner case, it probably will be
fairly common.  If a disk is intact when removed from the mirror,
though detach, offline, or simply being pulled, it should remain
importable somehow.  (Perhaps it does, after addressing the identical
pool id issue, though I haven't tried.)

In a similar way, it may be nice to allow detach to work on multiple
devices atomically.  For instance, if you have a set of mirror vdevs,
you could then split off an entire replica of the pool, and move it to
another machine.  I think you can do this today by simply exporting
the pool though, so it is not a major inconvenience.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] system hangs on POST after giving zfs a drive

2006-10-12 Thread Chris Csanady

On 10/11/06, John Sonnenschein [EMAIL PROTECTED] wrote:


As it turns out now, something about the drive is causing the machine to hang 
on POST. It boots fine if the drive isn't connected, and if I hot plug the 
drive after the machine boots, it works fine, but the computer simply will not 
boot with the drive attatched.

any thoughts on resolution?


Are you using an nforce4 based board?

I have a Tyan K8E, and it hangs on boot if there are EFI labeled disks
present.  (Which is what ZFS uses when you give it whole disks.)  If this
is the problem, configure the BIOS settings so as to not probe those
disks, and then it should boot.

Of course, it won't be possible to boot off those disks, but they should
work fine in Solaris.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: system hangs on POST after giving zfs a drive

2006-10-12 Thread Chris Csanady

On 10/12/06, John Sonnenschein [EMAIL PROTECTED] wrote:

well, it's an SiS 960 board, and it appears my only option to turn off probing 
of the drives is to enable RAID mode (which makes them inacessable by the OS)


I think the option is in the standard CMOS setup section, and allows you
to set the disk geometry, translation, etc.  There should be options for each
disk, something like: auto detect/manual/not present.  Hopefully your BIOS
has a similar setting.


what would be my next (cheapest) option, a proper SATA add-in card? I've heard 
good things about the silicon image 3132 based cards, but I'm not sure if 
they'll still leave my BIOS in the same position if i run the drives in ATA mode


The best supported card is the Supermicro AOC-SAT2-MV8.  Drivers are
also present for the SiI 3132/3124 based cards in the SATA framework,
but they haven't been updated in a while, and don't support NCQ yet.

Either way, unless you are using a recent nevada build, any controller
will only run in compatibility mode.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Metaslab alignment on RAID-Z

2006-09-26 Thread Chris Csanady

I believe I have tracked down the problem discussed in the low
disk performance thread.  It seems that an alignment issue will
cause small file/block performance to be abysmal on a RAID-Z.

metaslab_ff_alloc() seems to naturally align all allocations, and
so all blocks will be aligned to asize on a RAID-Z.  At certain
block sizes which do not produce full width writes, contiguous
writes will leave holes of dead space in the RAID-Z.

What I have observed with the iosnoop dtrace script is that the
first disks aggregate the single block writes, while the last disk(s)
are forced to do numerous writes every other sector.  If you would
like to reproduce this, simply copy a large file to a recordsize=4k
filesystem on a 4 disk RAID-Z.

It would probably fix the problem if this dead space was explicitly
zeroed to allow the writes to be aggregated, but that would be
an egregious hack.  If the alignment constraints could be relaxed
though, that should improve the parity distribution, as well as get
rid of the dead space and associated problem.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Metaslab alignment on RAID-Z

2006-09-26 Thread Chris Csanady

On 9/26/06, Richard Elling - PAE [EMAIL PROTECTED] wrote:

Chris Csanady wrote:
 What I have observed with the iosnoop dtrace script is that the
 first disks aggregate the single block writes, while the last disk(s)
 are forced to do numerous writes every other sector.  If you would
 like to reproduce this, simply copy a large file to a recordsize=4k
 filesystem on a 4 disk RAID-Z.

Why would I want to set recordsize=4k if I'm using large files?
For that matter, why would I ever want to use a recordsize=4k, is
there a database which needs 4k record sizes?


Sorry, I wasn't very clear about the reasoning for this.  It is not
something that you would normally do, but it generates just
the right combination of block size and stripe width to make the
problem very apparent.

It is also possible to encounter this on a filesystem with the
default recordsize, and I have observed the effect while extracting
a large archive of sources.  Still, it was never bad enough for my
uses to be anything more than a curiosity.  However, while trying
to rsync 100M ~1k files onto a 4 disk RAID-Z, Gino Ruopolo
seemingly stumbled upon this worst case performance scenerio.
(Though, unlike my example, it is also possible to end up with
holes in the second column.)

Also, while it may be a small error, could these stranded sectors
throw off the space accounting enough to cause problems when
a pool is nearly full?

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: low disk performance

2006-09-23 Thread Chris Csanady

On 9/22/06, Gino Ruopolo [EMAIL PROTECTED] wrote:

Update ...

iostat output during zpool scrub

  extended device statistics
device   r/sw/s   Mr/s   Mw/s wait actv  svc_t  %w  %b
sd34 2.0  395.20.10.6  0.0 34.8   87.7   0 100
sd3521.0  312.21.22.9  0.0 26.0   78.0   0  79
sd3620.01.01.20.0  0.0  0.7   31.4   0  13
sd3720.01.01.00.0  0.0  0.7   35.1   0  21

sd34 is always at 100% ...


What is strange, is that this is almost all writes.  Do you have
the rsync running at this time?  A scrub alone should not look
like this.

I have also observed some strange behavior on a 4 disk raidz,
which may be related.  It is possible to saturate a single disk,
while all the others in the same vdev are completely idle.  It
is very easy to reproduce, so try the following:

Create a filesystem with a 4k recordsize on a 4 disk raidz.
Now, copy a large file to it, while observing 'iostat -xnz 5'.
This is the worst case I have been able to produce, but the
imbalance is apparent even with an untar at the default
recordsize.

Interestingly, it is always the last disk in the set which is busy.
This behavior does not occur with a 3 disk raidz, nor is it as
bad with other record sizes.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Bandwidth disparity between NFS and ZFS

2006-06-23 Thread Chris Csanady

While dd'ing to an nfs filesystem, half of the bandwidth is unaccounted
for.  What dd reports amounts to almost exactly half of what zpool iostat
or iostat show; even after accounting for the overhead of the two mirrored
vdevs.  Would anyone care to guess where it may be going?

(This is measured over 10 second intervals.  For 1 second intervals,
the bandwidth to the disks jumps around from 40MB/s to 240MB/s)

With a local dd, everything adds up.  This is with a b41 server, and a
MacOS 10.4 nfs client.  I have verified that the bandwidth at the network
interface is approximately that reported by dd, so the issue would appear
to be within the server.

Any suggestions would be welcome.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive write cache

2006-05-26 Thread Chris Csanady

On 5/26/06, Bart Smaalders [EMAIL PROTECTED] wrote:


There are two failure modes associated with disk write caches:


Failure modes aside, is there any benefit to a write cache when command
queueing is available?  It seems that the primary advantage is in allowing
old ATA hardware to issue writes in an asynchronous manner.  Beyond
that, it doesn't really make much sense, if the queue is deep enough.


ZFS enables the write cache and flushes it when committing transaction
groups; this insures that all of a transaction group appears or does
not appear on disk.


How often is the write cache flushed, and is it synchronous?  Unless I am
misunderstanding something, wouldn't it be better to use ordered tags, and
avoid cache flushes all together?

Also, does ZFS disable the disk read cache?  It seems that this would be
counterproductive with ZFS.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss