from:"Bill Sommerfeld"

Re: [zfs-discuss] Zvol vs zfs send/zfs receive

2012-09-15 Thread Bill Sommerfeld

On 09/14/12 22:39, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Dave Pooser

Unfortunately I did not realize that zvols require disk space sufficient
to duplicate the zvol, and my zpool wasn't big enough. After a false start
(zpool add is dangerous when low on sleep) I added a 250GB mirror and a
pair of 3GB mirrors to miniraid and was able to successfully snapshot the
zvol: miniraid/RichRAID@exportable


This doesn't make any sense to me.  The snapshot should not take up any 
(significant) space on the sending side.  It's only on the receiving side, 
trying to receive a snapshot, that you require space.  Because it won't clobber 
the existing zvol on the receiving side until the complete new zvol was 
received to clobber it with.

But simply creating the snapshot on the sending side should be no problem.


By default, zvols have reservations equal to their size (so that writes 
don't fail due to the pool being out of space).


Creating a snapshot in the presence of a reservation requires reserving 
enough space to overwrite every block on the device.


You can remove or shrink the reservation if you know that the entire 
device won't be overwritten.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Very poor small-block random write performance

2012-07-20 Thread Bill Sommerfeld


On 07/19/12 18:24, Traffanstead, Mike wrote:

iozone doesn't vary the blocksize during the test, it's a very
artificial test but it's useful for gauging performance under
different scenarios.

So for this test all of the writes would have been 64k blocks, 128k,
etc. for that particular step.

Just as another point of reference I reran the test with a Crucial M4
SSD and the results for 16G/64k were 35mB/s (x5 improvement).

I'll rerun that part of the test with zpool iostat and see what it says.


For random writes to work without forcing a lot of read i/o and 
read-modify-write sequences, set the recordsize on the filesystem used 
for the test to match the iozone recordsize.  For instance:


zfs set recordsize=64k $fsname

and ensure that the files used for the test are re-created after you 
make this setting change (recordsize is sticky at file creation time).





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Bill Sommerfeld

On 07/11/12 02:10, Sašo Kiselkov wrote:
 Oh jeez, I can't remember how many times this flame war has been going
 on on this list. Here's the gist: SHA-256 (or any good hash) produces a
 near uniform random distribution of output. Thus, the chances of getting
 a random hash collision are around 2^-256 or around 10^-77.

I think you're correct that most users don't need to worry about this --
sha-256 dedup without verification is not going to cause trouble for them.

But your analysis is off.  You're citing the chance that two blocks picked at
random will have the same hash.  But that's not what dedup does; it compares
the hash of a new block to a possibly-large population of other hashes, and
that gets you into the realm of birthday problem or birthday paradox.

See http://en.wikipedia.org/wiki/Birthday_problem for formulas.

So, maybe somewhere between 10^-50 and 10^-55 for there being at least one
collision in really large collections of data - still not likely enough to
worry about.

Of course, that assumption goes out the window if you're concerned that an
adversary may develop practical ways to find collisions in sha-256 within the
deployment lifetime of a system.  sha-256 is, more or less, a scaled-up sha-1,
and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from
2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see
http://en.wikipedia.org/wiki/SHA-1#SHA-1).

on a somewhat less serious note, perhaps zfs dedup should contain chinese
lottery code (see http://tools.ietf.org/html/rfc3607 for one explanation)
which asks the sysadmin to report a detected sha-256 collision to
eprint.iacr.org or the like...



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)

2012-05-28 Thread Bill Sommerfeld


On 05/28/12 17:13, Daniel Carosone wrote:

There are two problems using ZFS on drives with 4k sectors:

  1) if the drive lies and presents 512-byte sectors, and you don't
 manually force ashift=12, then the emulation can be slow (and
 possibly error prone). There is essentially an internal RMW cycle
 when a 4k sector is partially updated.  We use ZFS to get away
 from the perils of RMW :)

  2) with ashift=12, whther forced manually or automatically because
 the disks present 4k sectors, ZFS is less space-efficient for
 metadata and keeps fewer historical uberblocks.


two, more specific, problems I've run into recently:

 1) if you move a disk with an ashift=9 pool on it from a 
controller/enclosure/.. combo where it claims to have 512 byte sectors 
to a path where it is detected as having 4k sectors (even if it can cope 
with 512-byte aligned I/O), the pool will fail to import and appear to 
be gravely corrupted; the error message you get will make no mention of 
the sector size change.  Move the disk back to the original location and 
it imports cleanly.


 2) if you have a pool with ashift=9 and a disk dies, and the intended 
replacement is detected as having 4k sectors, it will not be possible to 
attach the disk as a replacement drive..


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?

2011-12-05 Thread Bill Sommerfeld

On 12/05/11 10:47, Lachlan Mulcahy wrote:
 zfs`lzjb_decompress10   0.0%
 unix`page_nextn31   0.0%
 genunix`fsflush_do_pages   37   0.0%
 zfs`dbuf_free_range   183   0.1%
 genunix`list_next5822   3.7%
 unix`mach_cpu_idle 150261  96.1%

your best bet in a situation like this -- where there's a lot of cpu time
spent in a generic routine -- is to use an alternate profiling method that
shows complete stack traces rather than just the top function on the stack.

often the names of functions two or three or four deep in the stack will point
at what's really responsible.

something as simple as:

dtrace -n 'profile-1001 { @[stack()] = count(); }'

(let it run for a bit then interrupt it).

should show who's calling list_next() so much.

- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs diff performance disappointing

2011-09-26 Thread Bill Sommerfeld

On 09/26/11 12:31, Nico Williams wrote:
 On Mon, Sep 26, 2011 at 1:55 PM, Jesus Cea j...@jcea.es wrote:
 Should I disable atime to improve zfs diff performance? (most data
 doesn't change, but atime of most files would change).
 
 atime has nothing to do with it.

based on my experiences with time-based snapshots and atime on a server which
had cron-driven file tree walks running every night, I can easily believe
atime has a lot to do with it - the atime updates associated with a tree walk
will mean that that much of a filesystem's metadata will diverge between the
writeable filesystem and its last snapshot.

- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-27 Thread Bill Sommerfeld

On 06/27/11 15:24, David Magda wrote:
 Given the amount of transistors that are available nowadays I think
 it'd be simpler to just create a series of SIMD instructions right
 in/on general CPUs, and skip the whole co-processor angle.

see: http://en.wikipedia.org/wiki/AES_instruction_set

Present in many current Intel CPUs; also expected to be present in AMD's
Bulldozer based CPUs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OpenIndiana | ZFS | scrub | network | awful slow

2011-06-16 Thread Bill Sommerfeld

On 06/16/11 15:36, Sven C. Merckens wrote:
 But is the L2ARC also important while writing to the device? Because
 the storeges are used most of the time only for writing data on it,
 the Read-Cache (as I thought) isn´t a performance-factor... Please
 correct me, if my thoughts are wrong.

if you're using dedup, you need a large read cache even if you're only
doing application-layer writes, because you need fast random read access
to the dedup tables while you write.

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disk replacement need to scan full pool ?

2011-06-14 Thread Bill Sommerfeld

On 06/14/11 04:15, Rasmus Fauske wrote:
 I want to replace some slow consumer drives with new edc re4 ones but
 when I do a replace it needs to scan the full pool and not only that
 disk set (or just the old drive)
 
 Is this normal ? (the speed is always slow in the start so thats not
 what I am wondering about, but that it needs to scan all of my 18.7T to
 replace one drive)

This is normal.  The resilver is not reading all data blocks; it's
reading all of the metadata blocks which contain one or more block
pointers, which is the only way to find all the allocated data (and in
the case of raidz, know precisely how it's spread and encoded across the
members of the vdev).  And it's reading all the data blocks needed to
reconstruct the disk to be replaced.

- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Wired write performance problem

2011-06-08 Thread Bill Sommerfeld


On 06/08/11 01:05, Tomas Ögren wrote:

And if pool usage is90%, then there's another problem (change of
finding free space algorithm).


Another (less satisfying) workaround is to increase the amount of free 
space in the pool, either by reducing usage or adding more storage. 
Observed behavior is that allocation is fast until usage crosses a 
threshhold, then performance hits a wall.


I have a small sample size (maybe 2-3 samples), but the threshhold point 
varies from pool to pool but tends to be consistent for a given pool.  I 
suspect some artifact of layout/fragmentation is at play.  I've seen 
things hit the wall at as low as 70% on one pool.


The original poster's pool is about 78% full.  If possible, try freeing 
stuff until usage goes back under 75% or 70% and see if your performance 
returns.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Available space confusion

2011-06-06 Thread Bill Sommerfeld


On 06/06/11 08:07, Cyril Plisko wrote:

zpool reports space usage on disks, without taking into account RAIDZ overhead.
zfs reports net capacity available, after RAIDZ overhead accounted for.


Yup.  Going back to the original numbers:

nebol@filez:/$ zfs list tank2
NAMEUSED  AVAIL  REFER  MOUNTPOINT
tank2  3.12T   902G  32.9K  /tank2

Given that it's a 4-disk raidz1, you have (roughly) one block of parity 
for every three blocks of data.


3.12T / 3 = 1.04T

so 3.12T + 1.04T = 4.16T, which is close to the 4.18T showed by zpool list:

NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
tank2  5.44T  4.18T  1.26T76%  ONLINE



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Bill Sommerfeld

On 05/31/11 09:01, Anonymous wrote:
 Hi. I have a development system on Intel commodity hardware with a 500G ZFS
 root mirror. I have another 500G drive same as the other two. Is there any
 way to use this disk to good advantage in this box? I don't think I need any
 more redundancy, I would like to increase performance if possible. I have
 only one SATA port left so I can only use 3 drives total unless I buy a PCI
 card. Would you please advise me. Many thanks.

I'd use the extra SATA port for an ssd, and use that ssd for some
combination of boot/root, ZIL, and L2ARC.

I have a couple systems in this configuration now and have been quite
happy with the config.  While slicing an ssd and using one slice for
root, one slice for zil, and one slice for l2arc isn't optimal from a
performance standpoint and won't scale up to a larger configuration, it
is a noticeable improvement from a 2-disk mirror.

I used an 80G intel X25-M, with 1G for zil, with the rest split roughly
50:50 between root pool and l2arc for the data pool.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Format returning bogus controller info

2011-02-26 Thread Bill Sommerfeld


On 02/26/11 17:21, Dave Pooser wrote:

  While trying to add drives one at a time so I can identify them for later
use, I noticed two interesting things: the controller information is
unlike any I've seen before, and out of nine disks added after the boot
drive all nine are attached to c12 -- and no single controller has more
than eight ports.


on your system, c12 is the mpxio virtual controller; any disk which is 
potentially multipath-able (and that includes the SAS drives) will 
appear as a child of the virtual controller (rather than appear as the 
child of two or more different physical controllers).


see stmsboot(1m) for information on how to turn that off if you don't 
need multipathing and don't like the longer device names.


- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/recv initial data load

2011-02-16 Thread Bill Sommerfeld


On 02/16/11 07:38, white...@gmail.com wrote:

 Is it possible to use a portable drive to copy the
initial zfs filesystem(s) to the remote location and then make the
subsequent incrementals over the network?


Yes.

 If so, what would I need to do

to make sure it is an exact copy? Thank you,


Rough outline:

plug removable storage into source or a system near the source.
zpool create backup pool on removable storage
use an appropriate combination of zfs send  zfs receive to copy bits.
zpool export backup pool.
unplug removable storage
move it
plug it in to remote server
zpool import backup pool
use zfs send -i to verify that incrementals work

(I did something like the above when setting up my home backup because I 
initially dinked around with the backup pool hooked up to a laptop and 
then moved it to a desktop system).


optional: use zpool attach to mirror the removable storage to something 
faster/better/..., then after the mirror completes zpool detach to free 
up the removable storage.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Bill Sommerfeld


On 02/07/11 11:49, Yi Zhang wrote:

The reason why I
tried that is to get the side effect of no buffering, which is my
ultimate goal.


ultimate = final.  you must have a goal beyond the elimination of 
buffering in the filesystem.


if the writes are made durable by zfs when you need them to be durable, 
why does it matter that it may buffer data while it is doing so?


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Bill Sommerfeld


On 02/07/11 12:49, Yi Zhang wrote:

If buffering is on, the running time of my app doesn't reflect the
actual I/O cost. My goal is to accurately measure the time of I/O.
With buffering on, ZFS would batch up a bunch of writes and change
both the original I/O activity and the time.


if batching main pool writes improves the overall throughput of the 
system over a more naive i/o scheduling model, don't you want your users 
to see the improvement in performance from that batching?


why not set up a steady-state sustained workload that will run for 
hours, and measure how long it takes the system to commit each 1000 or 
1 transactions in the middle of the steady state workload?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS advice for laptop

2011-01-04 Thread Bill Sommerfeld


On 01/04/11 18:40, Bob Friesenhahn wrote:

Zfs will disable write caching if it sees that a partition is being used


This is backwards.

ZFS will enable write caching on a disk if a single pool believes it 
owns the whole disk.


Otherwise, it will do nothing to caching.  You can enable it yourself 
with the format command and ZFS won't disable it.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express

2010-12-02 Thread Bill Sommerfeld


On 11/17/10 12:04, Miles Nordin wrote:

black-box crypto is snake oil at any level, IMNSHO.


Absolutely.


Congrats again on finishing your project, but every other disk
encryption framework I've seen taken remotely seriously has a detailed
paper describing the algorithm, not just a list of features and a
configuration guide.  It should be a requirement for anything treated
as more than a toy.  I might have missed yours, or maybe it's coming
soon.


In particular, the mechanism by which dedup-friendly block IV's are 
chosen based on the plaintext needs public scrutiny.  Knowing Darren, 
it's very likely that he got it right, but in crypto, all the details 
matter and if a spec detailed enough to allow for interoperability isn't 
available, it's safest to assume that some of the details are wrong.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Bill Sommerfeld


On 09/09/10 20:08, Edward Ned Harvey wrote:

Scores so far:
2 No
1 Yes


No.  resilver does not re-layout your data or change whats in the block 
pointers on disk.  if it was fragmented before, it will be fragmented after.



C) Does zfs send zfs receive mean it will defrag?


Scores so far:
1 No
2 Yes


maybe.  If there is sufficient contiguous freespace in the destination 
pool, files may be less fragmented.


But if you do incremental sends of multiple snapshots, you may well 
replicate some or all the fragmentation on the origin (because snapshots 
only copy the blocks that change, and receiving an incremental send does 
the same).


And if the destination pool is short on space you may end up more 
fragmented than the source.


- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with Equallogic storage

2010-08-21 Thread Bill Sommerfeld


On 08/21/10 10:14, Ross Walker wrote:

I am trying to figure out the best way to provide both performance and 
resiliency given the Equallogic provides the redundancy.


(I have no specific experience with Equallogic; the following is just 
generic advice)


Every bit stored in zfs is checksummed at the block level; zfs will not 
use data or metadata if the checksum doesn't match.


zfs relies on redundancy (storing multiple copies) to provide 
resilience; if it can't independently read the multiple copies and pick 
the one it likes, it can't recover from bitrot or failure of the 
underlying storage.


if you want resilience, zfs must be responsible for redundancy.

You imply having multiple storage servers.  The simplest thing to do is 
export one large LUN from each of two different storage servers, and 
have ZFS mirror them.


While this reduces the available space, depending on your workload, you 
can make some of it back by enabling compression.


And, given sufficiently recent software, and sufficient memory and/or 
ssd for l2arc, you can enable dedup.


Of course, the effectiveness of both dedup and compression depends on 
your workload.



Would I be better off forgoing resiliency for simplicity, putting all my faith 
into the Equallogic to handle data resiliency?


IMHO, no; the resulting system will be significantly more brittle.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Increase resilver priority

2010-07-23 Thread Bill Sommerfeld


On 07/23/10 02:31, Giovanni Tirloni wrote:

  We've seen some resilvers on idle servers that are taking ages. Is it
possible to speed up resilver operations somehow?

  Eg. iostat shows5MB/s writes on the replaced disks.


What build of opensolaris are you running?  There were some recent 
improvements (notably the addition of prefetch to the pool traverse used 
by scrub and resilver) which sped this up significantly for my systems.


Also: if there are large numbers of snapshots, pools seem to take longer 
to resilver, particularly when there's a lot of metadata divergence 
between snapshots.  Turning off atime updates (if you and your 
applications can cope with this) may also help going forward.


- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] L2ARC and ZIL on same SSD?

2010-07-22 Thread Bill Sommerfeld


On 07/22/10 04:00, Orvar Korvar wrote:

Ok, so the bandwidth will be cut in half, and some people use this
configuration. But, how bad is it to have the bandwidth cut in half?
Will it hardly notice?


For a home server, I doubt you'll notice.

I've set up several systems (desktop  home server) as follows:
- two large conventional disks, mirrored, as data pool.

- single X25-M, 80GB, divided in three slices:
50% in slice 0 as root pool,
(with dedup  compression enabled, and
copies=2 for rpool/ROOT)
1GB in slice 3 as ZIL for data pool
remainder in slice 4 as L2ARC for data pool.

two conventional disks + 1 ssd performs much better than two disks 
alone.  If I needed more space (I haven't, yet), I'd add another mirror 
pair or two to the data pool.


I've been very happy with the results.

- Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143

2010-07-20 Thread Bill Sommerfeld


On 07/20/10 14:10, Marcelo H Majczak wrote:

It also seems to be issuing a lot more
writing to rpool, though I can't tell what. In my case it causes a
lot of read contention since my rpool is a USB flash device with no
cache. iostat says something like up to 10w/20r per second. Up to 137
the performance has been enough, so far, for my purposes on this
laptop.


if pools are more than about 60-70% full, you may be running into 6962304

workaround: add the following to /etc/system, run
bootadm update-archive, and reboot

-cut here-
* Work around 6962304
set zfs:metaslab_min_alloc_size=0x1000
* Work around 6965294
set zfs:metaslab_smo_bonus_pct=0xc8
-cut here-

no guarantees, but it's helped a few systems..

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143

2010-07-20 Thread Bill Sommerfeld


On 07/20/10 14:10, Marcelo H Majczak wrote:

It also seems to be issuing a lot more
writing to rpool, though I can't tell what. In my case it causes a
lot of read contention since my rpool is a USB flash device with no
cache. iostat says something like up to 10w/20r per second. Up to 137
the performance has been enough, so far, for my purposes on this
laptop.


if pools are more than about 60-70% full, you may be running into 6962304

workaround: add the following to /etc/system, run
bootadm update-archive, and reboot

-cut here-
* Work around 6962304
set zfs:metaslab_min_alloc_size=0x1000
* Work around 6965294
set zfs:metaslab_smo_bonus_pct=0xc8
-cut here-

no guarantees, but it's helped a few systems..

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Bill Sommerfeld


On 06/15/10 10:52, Erik Trimble wrote:

Frankly, dedup isn't practical for anything but enterprise-class
machines. It's certainly not practical for desktops or anything remotely
low-end.


We're certainly learning a lot about how zfs dedup behaves in practice. 
 I've enabled dedup on two desktops and a home server and so far 
haven't regretted it on those three systems.


However, they each have more than typical amounts of memory (4G and up) 
a data pool in two or more large-capacity SATA drives, plus an X25-M ssd 
sliced into a root pool as well as l2arc and slog slices for the data 
pool (see below: [1])


I tried enabling dedup on a smaller system (with only 1G memory and a 
single very slow disk), observed serious performance problems, and 
turned it off pretty quickly.


I think, with current bits, it's not a simple matter of ok for 
enterprise, not ok for desktops.  with an ssd for either main storage 
or l2arc, and/or enough memory, and/or a not very demanding workload, it 
seems to be ok.


For one such system, I'm seeing:

# zpool list z
NAME   SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
z  464G   258G   206G55%  1.25x  ONLINE  -
# zdb -D z
DDT-sha256-zap-duplicate: 432759 entries, size 304 on disk, 156 in core
DDT-sha256-zap-unique: 1094244 entries, size 298 on disk, 151 in core

dedup = 1.25, compress = 1.44, copies = 1.00, dedup * compress / copies 
= 1.80

- Bill

[1] To forestall responses of the form: you're nuts for putting a slog 
on an x25-m, which is off-topic for this thread and being discussed 
elsewhere:


Yes, I'm aware of the write cache issues on power fail on the x25-m. 
For my purposes, it's a better robustness/performance tradeoff than 
either zil-on-spinning-rust or zil disabled, because:
 a) for many potential failure cases on whitebox hardware running 
bleeding edge opensolaris bits, the x25-m will not lose power and thus 
the write cache will stay intact across a crash.
 b) even if it loses power and loses some writes-in-flight, it's not 
likely to lose *everything* since the last txg sync.


It's good enough for my personal use.  Your mileage will vary.  As 
always, system design involves tradeoffs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New SSD options

2010-05-20 Thread Bill Sommerfeld


On 05/20/10 12:26, Miles Nordin wrote:

I don't know, though, what to do about these reports of devices that
almost respect cache flushes but seem to lose exactly one transaction.
AFAICT this should be a works/doesntwork situation, not a continuum.


But there's so much brokenness out there.  I've seen similar tail drop 
behavior before -- last write or two before a hardware reset goes into 
the bit bucket, but ones before that are durable.


So, IMHO, a cheap consumer ssd used as a zil may still be worth it (for 
some use cases) to narrow the window of data loss from ~30 seconds to a 
sub-second value.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS root ARC memory usage on VxFS system...

2010-05-07 Thread Bill Sommerfeld


On 05/07/10 15:05, Kris Kasner wrote:

Is ZFS swap cached in the ARC? I can't account for data in the ZFS filesystems
to use as much ARC as is in use without the swap files being cached.. seems a
bit redundant?


There's nothing to explicitly disable caching just for swap; from zfs's 
point of view, the swap zvol is just like any other zvol.


But, you can turn this off (assuming sufficiently recent zfs).  try:

zfs set primarycache=metadata rpool/swap

(or whatever your swap zvol is named).

(you probably want metadata rather than none so that things like 
indirect blocks for the swap device get cached).


- Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Single-disk pool corrupted after controller failure

2010-05-01 Thread Bill Sommerfeld


On 05/01/10 13:06, Diogo Franco wrote:

After seeing that on some cases labels were corrupted, I tried running
zdb -l on mine:

...
(labels 0, 1 not there, labels 2, 3 are there).


I'm looking for pointers on how to fix this situation, since the disk
still has available metadata.


there are two reasons why you could get this:
 1) the labels are gone.

 2) the labels are not at the start of what solaris sees as p1, and 
thus are somewhere else on the disk.  I'd look more closely at how 
freebsd computes the start of the partition or slice '/dev/ad6s1d'

that contains the pool.

I think #2 is somewhat more likely.

- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is it safe/possible to idle HD's in a ZFS Vdev to save wear/power?

2010-04-17 Thread Bill Sommerfeld


On 04/16/10 20:26, Joe wrote:

I was just wondering if it is possible to spindown/idle/sleep hard disks that are 
part of a Vdev  pool SAFELY?


it's possible.

my ultra24 desktop has this enabled by default (because it's a known 
desktop type).  see the power.conf man page; I think you may need to add 
an autopm enable if the system isn't recognized as a known desktop.


the disks spin down when the system is idle; there's a delay of a few 
seconds when they spin back up.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD best practices

2010-04-17 Thread Bill Sommerfeld


On 04/17/10 07:59, Dave Vrona wrote:

1) Mirroring.  Leaving cost out of it, should ZIL and/or L2ARC SSDs
be mirrored ?


L2ARC cannot be mirrored -- and doesn't need to be.  The contents are
checksummed; if the checksum doesn't match, it's treated as a cache miss
and the block is re-read from the main pool disks.

The ZIL can be mirrored, and mirroring it improves your ability to 
recover the pool in the face of multiple failures.



2) ZIL write cache.  It appears some have disabled the write cache on
the X-25E.  This results in a 5 fold performance hit but it
eliminates a potential mechanism for data loss.  Is this valid?


With the ZIL disabled, you may lose the last ~30s of writes to the pool 
(the transaction group being assembled and written at the time of the 
crash).


With the ZIL on a device with a write cache that ignores cache flush 
requests, you may lose the tail of some of the intent logs, starting 
with the first block in each log  which wasn't readable after the 
restart.  (I say may rather than will because some failures may not 
result in the loss of the write cache).  Depending on how quickly your 
ZIL device pushes writes from cache to stable storage, this may narrow 
the window from ~30s to less than 1s, but doesn't close the window entirely.



If I can mirror ZIL, I imagine this is no longer a concern?


Mirroring a ZIL device with a volatile write cache doesn't eliminate 
this risk.  Whether it reduces the risk depends on precisely *what* 
caused your system to crash and reboot; if the failure also causes loss 
of the write cache contents on both sides of the mirror, mirroring won't 
help.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggestions about current ZFS setup

2010-04-14 Thread Bill Sommerfeld


On 04/14/10 12:37, Christian Molson wrote:

First I want to thank everyone for their input, It is greatly appreciated.

To answer a few questions:

Chassis I have: 
http://www.supermicro.com/products/chassis/4U/846/SC846E2-R900.cfm

Motherboard:
http://www.tyan.com/product_board_detail.aspx?pid=560

RAM:
24 GB (12 x 2GB)

10 x 1TB Seagates 7200.11
10 x 1TB Hitachi
4   x 2TB WD WD20EARS (4K blocks)


If you have the spare change for it I'd add one or two SSD's to the mix, 
with space on them allocated to the root pool plus l2arc cache, and slog 
for the data pool(s).


- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedup screwing up snapshot deletion

2010-04-14 Thread Bill Sommerfeld


On 04/14/10 19:51, Richard Jahnel wrote:

This sounds like the known issue about the dedupe map not fitting in ram.


Indeed, but this is not correct:


When blocks are freed, dedupe scans the whole map to ensure each block is not 
is use before releasing it.


That's not correct.

dedup uses a data structure which is indexed by the hash of the contents 
of each block.  That hash function is effectively random, so it needs to 
access a *random* part of the map for each free which means that it (as 
you correctly stated):



... takes a veeery long time if the map doesn't fit in ram.

If you can try adding more ram to the system.


Adding a flash-based ssd as an cache/L2ARC device is also very 
effective; random i/o to ssd is much faster than random i/o to spinning 
rust.


- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Secure delete?

2010-04-11 Thread Bill Sommerfeld


On 04/11/10 10:19, Manoj Joseph wrote:

Earlier writes to the file might have left
older copies of the blocks lying around which could be recovered.


Indeed; to be really sure you need to overwrite all the free space in 
the pool.


If you limit yourself to worrying about data accessible via a regular 
read on the raw device, it's possible to do this without an outage if 
you have a spare disk and a lot of time:


rough process:

 0) delete the files and snapshots containing the data you wish to purge.

 1) replace a previously unreplaced disk in the pool with the spare 
disk using zpool replace


 2) wait for the replace to complete

 3) wipe the removed disk, using the purge command of format(1m)'s 
analyze subsystem or equivalent; the wiped disk is now the spare disk.


 4) if all disks have not been replaced yet, go back to step 1.

This relies on the fact that the resilver kicked off by zpool replace 
copies only allocated data.


There are some assumptions in the above.  For one, I'm assuming that 
that all disks in the pool are the same size.  A bigger one is that a 
purge is sufficient to wipe the disks completely -- probably the 
biggest single assumption, given that the underlying storage devices 
themselves are increasingly using copy-on-write techniques.


The most paranoid will replace all the disks and then physically destroy 
the old ones.


- Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Secure delete?

2010-04-11 Thread Bill Sommerfeld


On 04/11/10 12:46, Volker A. Brandt wrote:

The most paranoid will replace all the disks and then physically
destroy the old ones.


I thought the most paranoid will encrypt everything and then forget
the key... :-)


Actually, I hear that the most paranoid encrypt everything *and then*
destroy the physical media when they're done with it.


Seriously, once encrypted zfs is integrated that's a viable method.


It's certainly a new tool to help with the problem, but consider that
forgetting a key requires secure deletion of the key.

Like most cryptographic techniques, filesystem encryption only changes
the size of the problem we need to solve.

- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD sale on newegg

2010-04-06 Thread Bill Sommerfeld


On 04/06/10 17:17, Richard Elling wrote:

You could probably live with an X25-M as something to use for all three,
but of course you're making tradeoffs all over the place.


That would be better than almost any HDD on the planet because
the HDD tradeoffs result in much worse performance.


Indeed.  I've set up a couple small systems (one a desktop workstation, 
and the other a home fileserver) with root pool plus the l2arc and slog 
for a data pool on an 80G X25-M and have been very happy with the result.


The recipe I'm using is to slice the ssd, with the rpool in s0 with 
roughly half the space, 1GB in s3 for slog, and the rest of the space as 
L2ARC in s4.  That may actually be overly generous for the root pool, 
but I run with copies=2 on rpool/ROOT and I tend to keep a bunch of BE's 
around.


- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Tuning the ARC towards LRU

2010-04-05 Thread Bill Sommerfeld


On 04/05/10 15:24, Peter Schuller wrote:

In the urxvt case, I am basing my claim on informal observations.
I.e., hit terminal launch key, wait for disks to rattle, get my
terminal. Repeat. Only by repeating it very many times in very rapid
succession am I able to coerce it to be cached such that I can
immediately get my terminal. And what I mean by that is that it keeps
necessitating disk I/O for a long time, even on rapid successive
invocations. But once I have repeated it enough times it seems to
finally enter the cache.


Are you sure you're not seeing unrelated disk update activity like atime 
updates, mtime updates on pseudo-terminals, etc., ?


I'd want to start looking more closely at I/O traces (dtrace can be very 
helpful here) before blaming any specific system component for the 
unexpected I/O.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-22 Thread Bill Sommerfeld

On 03/22/10 11:02, Richard Elling wrote:
 Scrub tends to be a random workload dominated by IOPS, not bandwidth.

you may want to look at this again post build 128; the addition of
metadata prefetch to scrub/resilver in that build appears to have
dramatically changed how it performs (largely for the better).

- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] sympathetic (or just multiple) drive failures

2010-03-20 Thread Bill Sommerfeld


On 03/19/10 19:07, zfs ml wrote:

What are peoples' experiences with multiple drive failures?


1985-1986.  DEC RA81 disks.  Bad glue that degraded at the disk's 
operating temperature.  Head crashes.  No more need be said.


- Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Scrub not completing?

2010-03-17 Thread Bill Sommerfeld


On 03/17/10 14:03, Ian Collins wrote:

I ran a scrub on a Solaris 10 update 8 system yesterday and it is 100%
done, but not complete:

   scrub: scrub in progress for 23h57m, 100.00% done, 0h0m to go


Don't panic.  If zpool iostat still shows active reads from all disks 
in the pool, just step back and let it do its thing until it says the 
scrub is complete.


There's a bug open on this:

6899970 scrub/resilver percent complete reporting in zpool status can be 
overly optimistic


scrub/resilver progress reporting compares the number of blocks read so 
far to the number of blocks currently allocated in the pool.


If blocks that have already been visited are freed and new blocks are 
allocated, the seen:allocated ratio is no longer an accurate estimate of 
how much more work is needed to complete the scrub.


Before the scrub prefetch code went in, I would routinely see scrubs 
last 75 hours which had claimed to be 100.00% done for over a day.


- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Snapshot recycle freezes system activity

2010-03-08 Thread Bill Sommerfeld


On 03/08/10 12:43, Tomas Ögren wrote:
So we tried adding 2x 4GB USB sticks (Kingston Data

Traveller Mini Slim) as metadata L2ARC and that seems to have pushed the
snapshot times down to about 30 seconds.


Out of curiosity, how much physical memory does this system have?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] terrible ZFS performance compared to UFS on ramdisk (70% drop)

2010-03-08 Thread Bill Sommerfeld


On 03/08/10 17:57, Matt Cowger wrote:

Change zfs options to turn off checksumming (don't want it or need it), atime, 
compression, 4K block size (this is the applications native blocksize) etc.


even when you disable checksums and compression through the zfs command, 
zfs will still compress and checksum metadata.


the evil tuning guide describes an unstable interface to turn off 
metadata compression, but I don't see anything in there for metadata 
checksums.


if you have an actual need for an in-memory filesystem, will tmpfs fit 
the bill?


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] swap across multiple pools

2010-03-03 Thread Bill Sommerfeld


On 03/03/10 05:19, Matt Keenan wrote:

In a multipool environment, would be make sense to add swap to a pool outside or
the root pool, either as the sole swap dataset to be used or as extra swap ?


Yes.  I do it routinely, primarily to preserve space on boot disks on 
large-memory systems.


swap can go in any pool, while dump has the same limitations as root: 
single top-level vdev, single-disk or mirrors only.



Would this have any performance implications ?


If the non-root pool has many spindles, random read I/O should be faster 
and thus swap i/o should be faster.  I haven't attempted to measure if 
this makes a difference.


I generally set primarycache=metadata on swap zvols but I also haven't 
been able to measure whether it makes any difference.


My users do complain when /tmp fills because there isn't sufficient swap 
so I do know I need large amounts of swap on these systems.  (when 
migrating one such system from Nevada to Opensolaris recently I forgot 
to add swap to /etc/vfstab).


- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Bill Sommerfeld


On 03/02/10 08:13, Fredrich Maney wrote:

Why not do the same sort of thing and use that extra bit to flag a
file, or directory, as being an ACL only file and will negate the rest
of the mask? That accomplishes what Paul is looking for, without
breaking the existing model for those that need/wish to continue to
use it?


While we're designing on the fly: Another possibility would be to use an 
additional umask bit or two to influence the mode-bit - acl interaction.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] compressed root pool at installation time with flash archive predeployment script

2010-03-02 Thread Bill Sommerfeld


On 03/02/10 12:57, Miles Nordin wrote:

cc == chad campbellchad.campb...@cummins.com  writes:


 cc  I was trying to think of a way to set compression=on
 cc  at the beginning of a jumpstart.

are you sure grub/ofwboot/whatever can read compressed files?


Grub and the sparc zfs boot blocks can read lzjb-compressed blocks in zfs.

I have compression=on (and copies=2) for both sparc and x86 roots; I'm 
told that grub's zfs support also knows how to fall back to ditto blocks 
if the first copy fails to be readable or has a bad checksum.


- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-01 Thread Bill Sommerfeld


On 03/01/10 13:50, Miles Nordin wrote:

dd == David Dyer-Bennetd...@dd-b.net  writes:


 dd  Okay, but the argument goes the other way just as well -- when
 dd  I run chmod 6400 foobar, I want the permissions set that
 dd  specific way, and I don't want some magic background feature
 dd  blocking me.

This will be true either way.  Even if chmod isn't ignored, it will
reach into the nest of ACL's and mangle them in some non-obvious way
with unpredictable consequences, and the mangling will be implemented
by a magical background feature.


actually, you can be surprised even if there are no acls in use -- if, 
unbeknownst to you, some user has been granted file_dac_read or 
file_dac_write privilege, they will be able to bypass the file modes for 
read and/or for write.


Likewise if that user has been delegated zfs send rights on the 
filesystem the file is in, they'll be able to read every bit of the file.


- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS compression and deduplication on root pool on SSD

2010-02-28 Thread Bill Sommerfeld


On 02/28/10 15:58, valrh...@gmail.com wrote:

Also, I don't have the numbers to prove this, but it seems to me

 that the actual size of rpool/ROOT has grown substantially since I
 did a clean install of build 129a (I'm now at build133). WIthout
 compression, either, that was around 24 GB, but things seem
 to have accumulated by an extra 11 GB or so.

One common source for this is slowly accumulating files under
/var/pkg/download.

Clean out /var/pkg/download and delete all but the most recent boot 
environment to recover space (you need to do this to get the space back 
because the blocks are referenced by the snapshots used by each clone as 
its base version).


To avoid this in the future, set PKG_CACHEDIR in your environment to 
point at a filesystem which isn't cloned by beadm -- something outside 
rpool/ROOT, for instance.


On several systems which have two pools (root  data) I've relocated it 
to the data pool - it doesn't have to be part of the root pool.  This 
has significantly slimmed down my root filesystem on systems which are 
chasing the dev branch of opensolaris.


 At present, my rpool/ROOT has no compression, and no deduplication. I 
 was wondering about whether it would be a good idea, from a

 performance and data integrity standpoint, to use one, the other, or
 both, on the root pool.

I've used the combination of copies=2 and compression=yes on rpool/ROOT 
for a while and have been happy with the result.


On one system I recently moved to an ssd root, I also turned on dedup 
and it seems to be doing just fine:


NAME   SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
r2  37G  14.7G  22.3G39%  1.31x  ONLINE  -

(the relatively high dedup ratio is because I have one live upgrade BE 
with nevada build 130, and a beadm BE with opensolaris build 130, which 
is mostly the same)


- Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-26 Thread Bill Sommerfeld


On 02/26/10 10:45, Paul B. Henson wrote:

I've already posited as to an approach that I think would make a pure-ACL
deployment possible:


http://mail.opensolaris.org/pipermail/zfs-discuss/2010-February/037206.html

Via this concept or something else, there needs to be a way to configure
ZFS to prevent the attempted manipulation of legacy permission mode bits
from breaking the security policy of the ACL.


I believe this proposal is sound.

In it, you wrote:


The feedback was that the internal Sun POSIX compliance police
wouldn't like that ;).


There are already per-filesystem tunables for ZFS which allow the
system to escape the confines of POSIX (noatime, for one); I don't see
why a chmod doesn't truncate acls option couldn't join it so long as
it was off by default and left off while conformance tests were run.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Freeing unused space in thin provisioned zvols

2010-02-26 Thread Bill Sommerfeld


On 02/26/10 11:42, Lutz Schumann wrote:

Idea:
   - If the guest writes a block with 0's only, the block is freed again
   - if someone reads this block again - it wil get the same 0's it would get 
if the 0's would be written
- The checksum of a all 0 block dan be hard coded for SHA1 / Flecher, so the comparison 
for is this a 0 only block is easy.

With this in place, a host wishing to free thin provisioned zvol space can fill 
the unused blocks wirth 0s easity with simple tools (e.g. dd if=/dev/zero 
of=/MYFILE bs=1M; rm /MYFILE) and the space is freed again on the zvol side.


You've just described how ZFS behaves when compression is enabled -- a 
block of zeros is compressed to a hole represented by an all-zeros block 
pointer.


 Does anyone know why this is not incorporated into ZFS ?

It's in there.  Turn on compression to use it.


- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-26 Thread Bill Sommerfeld


On 02/26/10 17:38, Paul B. Henson wrote:

As I wrote in that new sub-thread, I see no option that isn't surprising
in some way.  My preference would be for what I labeled as option (b).


And I think you absolutely should be able to configure your fileserver to
implement your preference. Why shouldn't I be able to configure my
fileserver to implement mine :)?


acl-chmod interactions have been mishandled so badly in the past that i 
think a bit of experimentation with differing policies is in order.


Based on the amount of wailing I see around acls, I think that, based on 
personal experience with both systems, AFS had it more or less right and 
POSIX got it more or less wrong -- once you step into the world of acls, 
the file mode should be mostly ignored, and an accidental chmod should 
*not* destroy carefully crafted acls.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ZIL + L2ARC SSD Setup

2010-02-12 Thread Bill Sommerfeld


On 02/12/10 09:36, Felix Buenemann wrote:

given I've got ~300GB L2ARC, I'd
need about 7.2GB RAM, so upgrading to 8GB would be enough to satisfy the
L2ARC.


But that would only leave ~800MB free for everything else the server 
needs to do.


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-11 Thread Bill Sommerfeld


On 02/11/10 10:33, Lori Alt wrote:

This bug is closed as a dup of another bug which is not readable from
the opensolaris site, (I'm not clear what makes some bugs readable and
some not).


the other bug in question was opened yesterday and probably hasn't had 
time to propagate.


- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] most of my space is gone

2010-02-06 Thread Bill Sommerfeld


On 02/06/10 08:38, Frank Middleton wrote:

AFAIK there is no way to get around this. You can set a flag so that pkg
tries to empty /var/pkg/downloads, but even though it looks empty, it
won't actually become empty until you delete the snapshots, and IIRC
you still have to manually delete the contents. I understand that you
can try creating a separate dataset and mounting it on /var/pkg, but I
haven't tried it yet, and I have no idea if doing so gets around the
BE snapshot problem.


You can set the environment variable PKG_CACHEDIR to place the cache in 
an alternate filesystem.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] server hang with compression on, ping timeouts from remote machine

2010-01-31 Thread Bill Sommerfeld


On 01/31/10 07:07, Christo Kutrovsky wrote:

I've also experienced similar behavior (short freezes) when running
zfs send|zfs receive with compression on LOCALLY on ZVOLs again.

Has anyone else experienced this ? Know any of bug? This is on
snv117.


you might also get better results after the fix to:

6881015 ZFS write activity prevents other threads from running in a 
timely manner


which was fixed in build 129.

As a workaround, try a lower gzip compression level -- higher gzip
levels usually burn lots more CPU without significantly increasing the
compression ratio.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zvol being charged for double space

2010-01-27 Thread Bill Sommerfeld


On 01/27/10 21:17, Daniel Carosone wrote:

This is as expected.  Not expected is that:

  usedbyrefreservation = refreservation

I would expect this to be 0, since all the reserved space has been
allocated.


This would be the case if the volume had no snapshots.


As a result, used is over twice the size of the volume (+
a few small snapshots as well).


I'm seeing essentially the same thing with a recently-created zvol
with snapshots that I export via iscsi for time machine backups on a
mac.

% zfs list  -r -o 
name,refer,used,usedbyrefreservation,refreservation,volsize z/tm/mcgarrett

NAMEREFER   USED  USEDREFRESERV  REFRESERV  VOLSIZE
z/tm/mcgarrett  26.7G  88.2G60G60G  60G

The actual volume footprint is a bit less than half of the volume
size, but the refreservation ensures that there is enough free space
in the pool to allow me to overwrite every block of the zvol with
uncompressable data without any writes failing due to the pool being
out of space.

If you were to disable time-based snapshots and then overwrite a measurable
fraction of the zvol you I'd expect USEDBYREFRESERVATION to shrink as
the reserved blocks were actually used.

If you want to allow for overcommit, you need to delete the refreservation.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disks and caches

2010-01-07 Thread Bill Sommerfeld

On Thu, 2010-01-07 at 11:07 -0800, Anil wrote:
 There is talk about using those cheap disks for rpool. Isn't rpool
 also prone to a lot of writes, specifically when the /tmp is in a SSD?

Huh?  By default, solaris uses tmpfs for /tmp, /var/run,
and /etc/svc/volatile; writes to those filesystems won't hit the SSD
unless the system is short on physical memory.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool fragmentation issues?

2009-12-15 Thread Bill Sommerfeld

On Tue, 2009-12-15 at 17:28 -0800, Bill Sprouse wrote:
 After  
 running for a while (couple of months) the zpool seems to get  
 fragmented, backups take 72 hours and a scrub takes about 180  
 hours. 

Are there periodic snapshots being created in this pool?  

Can they run with atime turned off?

(file tree walks performed by backups will update the atime of all
directories; this will generate extra write traffic and also cause
snapshots to diverge from their parents and take longer to scrub).

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs on ssd

2009-12-11 Thread Bill Sommerfeld

On Fri, 2009-12-11 at 13:49 -0500, Miles Nordin wrote:
  sh == Seth Heeren s...@zfs-fuse.net writes:
 
 sh If you don't want/need log or cache, disable these? You might
 sh want to run your ZIL (slog) on ramdisk.
 
 seems quite silly.  why would you do that instead of just disabling
 the ZIL?  I guess it would give you a way to disable it pool-wide
 instead of system-wide.
 
 A per-filesystem ZIL knob would be awesome.

for what it's worth, there's already a per-filesystem ZIL knob: the
logbias property.  It can be set either to latency or 
throughput.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver/scrub times?

2009-11-22 Thread Bill Sommerfeld

Yesterday's integration of 

6678033 resilver code should prefetch

as part of changeset 74e8c05021f1 (which should be in build 129 when it
comes out) may improve scrub times, particularly if you have a large
number of small files and a large number of snapshots.  I recently
tested an early version of the fix, and saw one pool go from an elapsed
time of 85 hours to 20 hours; another (with many fewer snapshots) went
from 35 to 17.  

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs eradication

2009-11-11 Thread Bill Sommerfeld

On Wed, 2009-11-11 at 10:29 -0800, Darren J Moffat wrote:
 Joerg Moellenkamp wrote:
  Hi,
  
  Well ... i think Darren should implement this as a part of
 zfs-crypto. Secure Delete on SSD looks like quite challenge, when wear
 leveling and bad block relocation kicks in ;)
 
 No I won't be doing that as part of the zfs-crypto project. As I said 
 some jurisdictions are happy that if the data is encrypted then 
 overwrite of the blocks isn't required.   For those that aren't use 
 dd(1M) or format(1M) may be sufficient - if that isn't then nothing 
 short of physical destruction is likely good enough.

note that eradication via overwrite makes no sense if the underlying
storage uses copy-on-write, because there's no guarantee that the newly
written block actually will overlay the freed block.

IMHO the sweet spot here may be to overwrite once with zeros (allowing
the block to be compressed out of existance if the underlying storage is
a compressed zvol or equivalent) or to use the TRIM command.

(It may also be worthwhile for zvols exported via various protocols to
themselves implement the TRIM command -- freeing the underlying
storage).

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] This is the scrub that never ends...

2009-11-10 Thread Bill Sommerfeld

On Fri, 2009-09-11 at 13:51 -0400, Will Murnane wrote:
 On Thu, Sep 10, 2009 at 13:06, Will Murnane will.murn...@gmail.com wrote:
  On Wed, Sep 9, 2009 at 21:29, Bill Sommerfeld sommerf...@sun.com wrote:
  Any suggestions?
 
  Let it run for another day.
  I'll let it keep running as long as it wants this time.
  scrub: scrub completed after 42h32m with 0 errors on Thu Sep 10 17:20:19 2009
 
 And the people rejoiced.  So I guess the issue is more scrubs may
 report ETA very inaccurately than scrubs never finish.  Thanks for
 the suggestions and support.

One of my pools routinely does this -- the scrub gets to 100% after
about 50 hours but keeps going for another day or more after that.

It turns out that zpool reports number of blocks visited vs number of
blocks allocated, but clamps the ratio at 100%.

If there is substantial turnover in the pool, it appears you may end up
needing to visit more blocks than are actually allocated at any one
point in time.

I made a modified version of the zpool command and this is what it
prints for me:

...
 scrub: scrub in progress for 74h25m, 119.90% done, 0h0m to go
 5428197411840 blocks examined, 4527262118912 blocks allocated
...

This is the (trivial) source change I made to see what's going on under
the covers:

diff -r 12fb4fb507d6 usr/src/cmd/zpool/zpool_main.c
--- a/usr/src/cmd/zpool/zpool_main.cMon Oct 26 22:25:39 2009 -0700
+++ b/usr/src/cmd/zpool/zpool_main.cTue Nov 10 17:07:59 2009 -0500
@@ -2941,12 +2941,15 @@
 
if (examined == 0)
examined = 1;
-   if (examined  total)
-   total = examined;
 
fraction_done = (double)examined / total;
-   minutes_left = (uint64_t)((now - start) *
-   (1 - fraction_done) / fraction_done / 60);
+   if (fraction_done  1) {
+   minutes_left = (uint64_t)((now - start) *
+   (1 - fraction_done) / fraction_done / 60);
+   } else {
+   minutes_left = 0;
+   }
+
minutes_taken = (uint64_t)((now - start) / 60);
 
(void) printf(gettext(%s in progress for %lluh%um, %.2f%% done,

@@ -2954,6 +2957,9 @@
scrub_type, (u_longlong_t)(minutes_taken / 60),
(uint_t)(minutes_taken % 60), 100 * fraction_done,
(u_longlong_t)(minutes_left / 60), (uint_t)(minutes_left %
60));
+   (void) printf(gettext(\t %lld blocks examined, %lld blocks
allocated\n),
+   examined,
+   total);
 }
 
 static void

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedupe question

2009-11-07 Thread Bill Sommerfeld

On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote:
 Does the dedupe functionality happen at the file level or a lower block
 level?

it occurs at the block allocation level.

 I am writing a large number of files that have the fol structure :
 
 -- file begins
 1024 lines of random ASCII chars 64 chars long
 some tilde chars .. about 1000 of then
 some text ( english ) for 2K
 more text ( english ) for 700 bytes or so
 --

ZFS's default block size is 128K and is controlled by the recordsize
filesystem property.  Unless you changed recordsize, each of the files
above would be a single block distinct from the others.

you may or may not get better dedup ratios with a smaller recordsize
depending on how the common parts of the file line up with block
boundaries.

the cost of additional indirect blocks might overwhelm the savings from
deduping a small common piece of the file.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] sched regularily writing a lots of MBs to the pool?

2009-11-04 Thread Bill Sommerfeld

zfs groups writes together into transaction groups; the physical writes
to disk are generally initiated by kernel threads (which appear in
dtrace as threads of the sched process).  Changing the attribution is
not going to be simple as a single physical write to the pool may
contain data and metadata changes triggered by multiple user processes.

You need to go up a level of abstraction and look at the vnode layer to
attribute writes to particular processes.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilvering, amount of data on disk, etc.

2009-10-26 Thread Bill Sommerfeld

On Mon, 2009-10-26 at 10:24 -0700, Brian wrote:
 Why does resilvering an entire disk, yield different amounts of data that was 
 resilvered each time.
 I have read that ZFS only resilvers what it needs to, but in the case of 
 replacing an entire disk with another formatted clean disk, you would think 
 the amount of data would be the same each time a disk is replaced with an 
 empty formatted disk. 
 I'm getting different results when viewing the 'zpool status' info (below)

replacing a disk adds an entry to the zpool history log, which
requires allocating blocks, which will change what's stored in the pool.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Bill Sommerfeld


On Fri, 2009-09-25 at 14:39 -0600, Lori Alt wrote:
 The list of datasets in a root pool should look something like this:
...
 rpool/swap  

I've had success with putting swap into other pools.  I believe others
have, as well.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAIDZ versus mirrroed

2009-09-18 Thread Bill Sommerfeld


On Wed, 2009-09-16 at 14:19 -0700, Richard Elling wrote:
 Actually, I had a ton of data on resilvering which shows mirrors and
 raidz equivalently bottlenecked on the media write bandwidth. However,
 there are other cases which are IOPS bound (or CR bound :-) which
 cover some of the postings here. I think Sommerfeld has some other
 data which could be pertinent.

I'm not sure I have data, but I have anecdotes and observations, and a
few large production pools used for solaris development by me and my
coworkers.

the biggest one (by disk count) takes 80-100 hours to scrub and/or
resilver.

my working hypothesis is that resilver of pools which:
 1) have a lot of files, directories, filesystems, and periodic
snapshots
 2) have atime updates enabled (default config)
 3) have regular (daily) jobs doing large-scale filesystem tree-walks

wind up rewriting most blocks of the dnode files on every tree walk
doing atime updates, and as a result the dnode file (but not most of the
blocks it points to) differs greatly from daily snapshot to daily
snapshot.

as a result, scrub/resilver traversals end up spending most of their 
time doing random reads of the dnode files of each snapshot.

here are some bugs that, if fixed, might help:

6678033 resilver code should prefetch
6730737 investigate colocating directory dnodes

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] This is the scrub that never ends...

2009-09-09 Thread Bill Sommerfeld


On Wed, 2009-09-09 at 21:30 +, Will Murnane wrote:
 Some hours later, here I am again:
  scrub: scrub in progress for 18h24m, 100.00% done, 0h0m to go
 Any suggestions?

Let it run for another day.  

A pool on a build server I manage takes about 75-100 hours to scrub, but
typically starts reporting 100.00% done, 0h0m to go at about the 50-60
hour point.  

I suspect the combination of frequent time-based snapshots and a pretty
active set of users causes the progress estimate to be off..

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs kernel compilation issue

2009-08-29 Thread Bill Sommerfeld


On Fri, 2009-08-28 at 23:12 -0700, P. Anil Kumar wrote:
 I would like to know why its picking up amd64 config params from the 
 Makefile, while uname -a clearly shows that its i386 ?

it's behaving as designed.

on solaris, uname -a always shows i386 regardless of whether the system
is in 32-bit or 64-bit mode.  you can use the isainfo command to tell if
amd64 is available.

on i386, we always build both 32-bit and 64-bit kernel modules; the
bootloader will figure out which kernel to load.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] avail drops to 32.1T from 40.8T after create -o mountpoint

2009-07-30 Thread Bill Sommerfeld

On Wed, 2009-07-29 at 06:50 -0700, Glen Gunselman wrote:
 There was a time when manufacturers know about base-2 but those days  
 are long gone.

Oh, they know all about base-2; it's just that disks seem bigger when
you use base-10 units.

Measure a disk's size in 10^(3n)-based KB/MB/GB/TB units, and you get a
bigger number than its size in the natural-for-software 2^(10n)-sized
units.

So it's obvious which numbers end up on the marketing glossies, and it's
all downhill from there...

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-22 Thread Bill Sommerfeld

On Mon, 2009-06-22 at 06:06 -0700, Richard Elling wrote:
 Nevertheless, in my lab testing, I was not able to create a random-enough
 workload to not be write limited on the reconstructing drive.  Anecdotal
 evidence shows that some systems are limited by the random reads.

Systems I've run which have random-read-limited reconstruction have a
combination of:
 - regular time-based snapshots
 - daily cron jobs which walk the filesystem, accessing all directories
and updating all directory atimes in the process.

Because the directory dnodes are randomly distributed through the dnode
file, each block of the dnode file likely contains at least one
directory dnode, and as a result each of the tree walk jobs causes the
entire dnode file to diverge from the previous day's snapshot.

If the underlying filesystems are mostly static and there are dozens of
snapshots, a pool traverse spends most of its time reading the dnode
files and finding block pointers to older blocks which it knows it has
already seen.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] compression at zfs filesystem creation

2009-06-19 Thread Bill Sommerfeld

On Wed, 2009-06-17 at 12:35 +0200, casper@sun.com wrote:
 I still use disk swap because I have some bad experiences 
 with ZFS swap.  (ZFS appears to cache and that is very wrong)

I'm experimenting with running zfs swap with the primarycache attribute
set to metadata instead of the default all.  

aka: 

zfs set primarycache=metadata rpool/swap 

seems like that would be more likely to behave appropriately.

- Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] schedulers [was: zfs related google summer of code ideas - your vote]

2009-03-04 Thread Bill Sommerfeld

On Wed, 2009-03-04 at 12:49 -0800, Richard Elling wrote:
 But I'm curious as to why you would want to put both the slog and
 L2ARC on the same SSD?

Reducing part count in a small system.

For instance: adding L2ARC+slog to a laptop.  I might only have one slot
free to allocate to ssd. 

IMHO the right administrative interface for this is for zpool to allow
you to add the same device to a pool as both cache and ssd, and let zfs
figure out how to not step on itself when allocating blocks.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-12 Thread Bill Sommerfeld


On Thu, 2009-02-12 at 17:35 -0500, Blake wrote:
 That does look like the issue being discussed.
 
 It's a little alarming that the bug was reported against snv54 and is
 still not fixed :(

bugs.opensolaris.org's information about this bug is out of date.

It was fixed in snv_54:

changeset:   3169:1dea14abfe17
user:phitran
date:Sat Nov 25 11:05:17 2006 -0800
files:   usr/src/uts/common/io/scsi/targets/sd.c

6424510 usb ignores DKIOCFLUSHWRITECACHE

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Problems at 90% zpool capacity 2008.05

2009-01-07 Thread Bill Sommerfeld


On Tue, 2009-01-06 at 22:18 -0700, Neil Perrin wrote:
 I vaguely remember a time when UFS had limits to prevent
 ordinary users from consuming past a certain limit, allowing
 only the super-user to use it. Not that I'm advocating that
 approach for ZFS.

looks to me like zfs already provides a mechanism for this (quotas and
reservations); it's up to the sysadmin to decide on policy.

Don't want the last 10% of the pool used?  Create a ballast zvol or
filesystem with a big reservation, and don't put anything in it..

Of course, some degree of experimentation may be necessary before you
figure out what policy makes sense for your system or site.

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Setting per-file record size / querying fs/file record size?

2008-10-22 Thread Bill Sommerfeld

On Wed, 2008-10-22 at 10:30 +0100, Darren J Moffat wrote:
 I'm assuming this is local filesystem rather than ZFS backed NFS (which 
 is what I have).

Correct, on a laptop.

 What has setting the 32KB recordsize done for the rest of your home
 dir, or did you give the evolution directory its own dataset ?

The latter, though it occurs to me that I could set the recordsize back
up to 128K once the databases (one per mail account) are created -- the
recordsize dataset property is read only at file create time when the
file's recordsize is set.  (Having a new interface to set the file's
recordsize directly at create time would bypass this sort of gyration).

(Apparently the sqlite file format uses 16-bit within-page offsets; 32kb
is its current maximum page size and 64k may be as large as it can go
without significant renovations..)

- Bill


- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disabling COMMIT at NFS level, or disabling ZIL on a per-filesystem basis

2008-10-22 Thread Bill Sommerfeld

On Wed, 2008-10-22 at 10:45 -0600, Neil Perrin wrote:
 Yes: 6280630 zil synchronicity
 
 Though personally I've been unhappy with the exposure that zil_disable has 
 got.
 It was originally meant for debug purposes only. So providing an official
 way to make synchronous behaviour asynchronous is to me dangerous.

It seems far more dangerous to only provide a global knob instead of a
local knob.

I want it in conjunction with bulk operations (like an ON nightly
build, database reloads, etc.) where the response to a partial failure
will be to rm -rf and start over.  Any time spent waiting for
intermediate states of the filesystem to be committed to stable store is
wasted time.

 Once Admins start to disable the ZIL for whole pools because the extra
 performance is too tempting, wouldn't it be the lesser evil to let them
 disable it on a per filesystem basis?

Agreed.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Tool to figure out optimum ZFS recordsize for a Mail server Maildir tree?

2008-10-22 Thread Bill Sommerfeld

On Wed, 2008-10-22 at 09:46 -0700, Mika Borner wrote:
 If I turn zfs compression on, does the recordsize influence the
 compressratio in anyway?

zfs conceptually chops the data into recordsize chunks, then compresses
each chunk independently, allocating on disk only the space needed to
store each compressed block.

On average, I'd expect to get a better compression ratio with a larger
block size since typical compression algorithms will have more chance to
find redundancy in a larger block of text.

as always your mileage may vary.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Setting per-file record size / querying fs/file record size?

2008-10-21 Thread Bill Sommerfeld

On Mon, 2008-10-20 at 16:57 -0500, Nicolas Williams wrote:
 I've a report that the mismatch between SQLite3's default block size and
 ZFS' causes some performance problems for Thunderbird users.

I was seeing a severe performance problem with sqlite3 databases as used
by evolution (not thunderbird).

It appears that reformatting the evolution databases to a 32KB database
page size and setting zfs's record size to a matching 32KB has done
wonders for evolution performance to a ZFS home directory.

 It'd be great if there was an API by which SQLite3 could set its block
 size to match the hosting filesystem or where it could set the DB file's
 record size to match the SQLite3/app default block size (1KB).

IMHO some of the fix has to involve sqlite3 using a larger page size by
default when creating the database -- it seems to be a lot more
efficient with the larger page size.

Databases like sqlite3 are being used under the covers by growing
numbers of applications -- it seems like there's a missing interface
here if we want decent out-of-the-box performance of end-user apps like
tbird and evolution using databases on zfs.

- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Bill Sommerfeld

On Wed, 2008-10-01 at 11:54 -0600, Robert Thurlow wrote:
  like they are not good enough though, because unless this broken
  router that Robert and Darren saw was doing NAT, yeah, it should not
  have touch the TCP/UDP checksum.

NAT was not involved.

 I believe we proved that the problem bit flips were such
 that the TCP checksum was the same, so the original checksum
 still appeared correct.

That's correct.   

The pattern we found in corrupted data was that there would be two
offsetting bit-flips.  

A 0-1 was followed 256 or 512 or 1024 bytes later by a 1-0 
Or vice-versa.  (It was always the same bit; in the cases I analyzed,
the corrupted files contained C source code and the bit-flips were
obvious).  Under the 16-bit one's-complement checksum used by TCP, these
two  changes cancel each other out and the resulting packet has the same
checksum.

  BTW which router was it, or you
  can't say because you're in the US? :)
 
 I can't remember; it was aging at the time.

to use excruciatingly precise terminology, I believe the switch in
question is marketed as a combo L2 bridge/L3 router but in this case may
have been acting as a bridge rather than a router. 

After we noticed the data corruption we looked at TCP counters on hosts
on that subnet and noticed a high rate of failed checksums, so clearly
the TCP checksum was catching *most* of the corrupted packets; we just
didn't look at the counters until after we saw data corruption.

- Bill









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver speed.

2008-09-05 Thread Bill Sommerfeld

On Fri, 2008-09-05 at 09:41 -0700, Richard Elling wrote:
  Also does the resilver deliberately pause?  Running iostat I see
 that it will pause for five to ten seconds where no IO is done at all,
 then it continues on at a more reasonable pace.

 I have not seen such behaviour during resilver characterization.

I have, post nv_94, and I filed a bug:

6729696 sync causes scrub or resilver to pause for up to 30s


- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-02 Thread Bill Sommerfeld

On Sun, 2008-08-31 at 12:00 -0700, Richard Elling wrote:
 2. The algorithm *must* be computationally efficient.
We are looking down the tunnel at I/O systems that can
deliver on the order of 5 Million iops.  We really won't
have many (any?) spare cycles to play with.

If you pick the constants carefully (powers of two) you can do the TCP
RTT + variance estimation using only a handful of shifts, adds, and
subtracts.

 In both of these cases, the solutions imply multi-minute timeouts are
 required to maintain a stable system.  

Again, there are different uses for timeouts:
 1) how long should we wait on an ordinary request before deciding to
try plan B and go elsewhere (a la B_FAILFAST)
 2) how long should we wait (while trying all alternatives) before
declaring an overall failure and giving up.

The RTT estimation approach is really only suitable for the former,
where you have some alternatives available (retransmission in the case
of TCP; trying another disk in the case of mirrors, etc.,).  

when you've tried all the alternatives and nobody's responding, there's
no substitute for just retrying for a long time.

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-02 Thread Bill Sommerfeld

On Sun, 2008-08-31 at 15:03 -0400, Miles Nordin wrote:

 It's sort of like network QoS, but not quite, because: 
 
   (a) you don't know exactly how big the ``pipe'' is, only
   approximately, 

In an ip network, end nodes generally know no more than the pipe size of
the first hop -- and in some cases (such as true CSMA networks like
classical ethernet or wireless) only have an upper bound on the pipe
size.  

beyond that, they can only estimate the characteristics of the rest of
the network by observing its behavior - all they get is end-to-end
latency, and *maybe* a 'congestion observed' mark set by an intermediate
system.

   (c) all the fabrics are lossless, so while there are queues which
   undesireably fill up during congestion, these queues never drop
   ``packets'' but instead exert back-pressure all the way up to
   the top of the stack.

hmm.  I don't think the back pressure makes it all the way up to zfs
(the top of the block storage stack) except as added latency.  

(on the other hand, if it did, zfs could schedule around it both for
reads and writes, avoiding pouring more work on already-congested
paths..)

 I'm surprised we survive as well as we do without disk QoS.  Are the
 storage vendors already doing it somehow?

I bet that (as with networking) in many/most cases overprovisioning the
hardware and running at lower average utilization is often cheaper in
practice than running close to the edge and spending a lot of expensive
expert time monitoring performance and tweaking QoS parameters.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Bill Sommerfeld

On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote:
 A better option would be to not use this to perform FMA diagnosis, but
 instead work into the mirror child selection code.  This has already
 been alluded to before, but it would be cool to keep track of latency
 over time, and use this to both a) prefer one drive over another when
 selecting the child and b) proactively timeout/ignore results from one
 child and select the other if it's taking longer than some historical
 standard deviation.  This keeps away from diagnosing drives as faulty,
 but does allow ZFS to make better choices and maintain response times.
 It shouldn't be hard to keep track of the average and/or standard
 deviation and use it for selection; proactively timing out the slow I/Os
 is much trickier.

tcp has to solve essentially the same problem: decide when a response is
overdue based only on the timing of recent successful exchanges in a
context where it's difficult to make assumptions about reasonable
expected behavior of the underlying network.

it tracks both the smoothed round trip time and the variance, and
declares a response overdue after (SRTT + K * variance).

I think you'd probably do well to start with something similar to what's
described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on
experience.

- Bill





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best layout for 15 disks?

2008-08-22 Thread Bill Sommerfeld

On Thu, 2008-08-21 at 21:15 -0700, mike wrote:
 I've seen 5-6 disk zpools are the most recommended setup.

This is incorrect.

Much larger zpools built out of striped redundant vdevs (mirror, raidz1,
raidz2) are recommended and also work well.

raidz1 or raidz2 vdevs of more than a single-digit number of drives are
not recommended.

so, for instance, the following is an appropriate use of 12 drives in
two raidz2 sets of 6 disks, with 8 disks worth of raw space available:

zpool create mypool raidz2 disk0 disk1 disk2 disk3 disk4 disk5
zpool add mypool raidz2 disk6 disk7 disk8 disk9 disk10 disk11

 In traditional RAID terms, I would like to do RAID5 + hot spare (13
 disks usable) out of the 15 disks (like raidz2 I suppose). What would
 make the most sense to setup 15 disks with ~ 13 disks of usable space?

Enable compression, and set up multiple raidz2 groups.  Depending on
what you're storing, you may get back more than you lose to parity.

  This is for a home fileserver, I do not need HA/hotplugging/etc. so I
 can tolerate a failure and replace it with plenty of time. It's not
 mission critical.

That's a lot of spindles for a home fileserver.   I'd be inclined to go
with a smaller number of larger disks in mirror pairs, allowing me to
buy larger disks in pairs as they come on the market to increase
capacity.

 Same question, but 10 disks, and I'd sacrifice one for parity then.
 Not two. so ~9 disks usable roughly (like raidz)

zpool create mypool raidz1 disk0 disk1 disk2 disk3 disk4
zpool add mypool raidz1 disk5 disk6 disk7 disk8 disk9

8 disks raw capacity, can survive the loss of any one disk or the loss
of two disks in different raidz groups.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] more ZFS recovery

2008-08-07 Thread Bill Sommerfeld

On Thu, 2008-08-07 at 11:34 -0700, Richard Elling wrote:
 How would you describe the difference between the data recovery
 utility and ZFS's normal data recovery process?

I'm not Anton but I think I see what he's getting at.

Assume you have disks which once contained a pool but all of the
uberblocks have been clobbered.  So you don't know where the root of the
block tree is, but all the actual data is there, intact, on the disks.  

Given the checksums you could rebuild one or more plausible structure of
the pool from the bottom up.

I'd think that you could construct an offline zpool data recovery tool
where you'd start with N disk images and a large amount of extra working
space, compute checksums of all possible data blocks on the images, scan
the disk images looking for things that might be valid block pointers,
and attempt to stitch together subtrees of the filesystem and recover as
much as you can even if many upper nodes in the block tree have had
holes shot in them by a miscreant device.

- Bill





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Checksum error: which of my files have failed scrubbing?

2008-08-05 Thread Bill Sommerfeld

On Tue, 2008-08-05 at 12:11 -0700, soren wrote:
  soren wrote:
   ZFS has detected that my root filesystem has a
  small number of errors.  Is there a way to tell which
  specific files have been corrupted?
 
  After a scrub a zpool status -v should give you a
  list of files with 
  unrecoverable errors.
 
 Hmm, I just tried that.  Perhaps No known data errors means that my files 
 are OK.  In that case I wonder what the checksum failure was from.

If this is build 94 and you have one or more unmounted filesystems, 
(such as alternate boot environments), these errors are false positives.
There is no actual error; the scrubber misinterpreted the end of an
intent log block chain as a checksum error.

the bug id is:

6727872 zpool scrub: reports checksum errors for pool with zfs and
unplayed ZIL

This bug is fixed in build 95.  One workaround is to mount the
filesystems and then unmount them to apply the intent log changes.

- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Block unification in ZFS

2008-08-05 Thread Bill Sommerfeld

See the long thread titled ZFS deduplication, last active
approximately 2 weeks ago.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can I trust ZFS?

2008-08-03 Thread Bill Sommerfeld

On Sun, 2008-08-03 at 11:42 -0500, Bob Friesenhahn wrote:
 Zfs makes human error really easy.  For example
 
$ zpool destroy mypool

Note that zpool destroy can be undone by zpool import -D (if you get
to it before the disks are overwritten).

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-20 Thread Bill Sommerfeld

On Fri, 2008-07-18 at 10:28 -0700, JÃ¼rgen Keil wrote:
  I ran a scrub on a root pool after upgrading to snv_94, and got checksum 
  errors:
 
 Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
 on a system that is running post snv_94 bits:  It also found checksum errors
 
 # zpool status files
   pool: files
  state: DEGRADED
 status: One or more devices has experienced an unrecoverable error.  An
   attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
   using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008
 config:
 
   NAME  STATE READ WRITE CKSUM
   files DEGRADED 0 018
 mirror  DEGRADED 0 018
   c8t0d0s6  DEGRADED 0 036  too many errors
   c9t0d0s6  DEGRADED 0 036  too many errors
 
 errors: No known data errors

out of curiosity, is this a root pool?  

A second system of mine with a mirrored root pool (and an additional
large multi-raidz pool) shows the same symptoms on the mirrored root
pool only.

once is accident.  twice is coincidence.  three times is enemy
action :-)

I'll file a bug as soon as I can (I'm travelling at the moment with
spotty connectivity), citing my and your reports.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-17 Thread Bill Sommerfeld

I ran a scrub on a root pool after upgrading to snv_94, and got checksum
errors:

  pool: r00t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h26m with 1 errors on Thu Jul 17 14:52:14
2008
config:

NAME  STATE READ WRITE CKSUM
r00t  ONLINE   0 0 2
  mirror  ONLINE   0 0 2
c4t0d0s0  ONLINE   0 0 4
c4t1d0s0  ONLINE   0 0 4

I ran it again, and it's now reporting the same errors, but still says
applications are unaffected:

  pool: r00t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h27m with 2 errors on Thu Jul 17 20:24:15 2008
config:

NAME  STATE READ WRITE CKSUM
r00t  ONLINE   0 0 4
  mirror  ONLINE   0 0 4
c4t0d0s0  ONLINE   0 0 8
c4t1d0s0  ONLINE   0 0 8

errors: No known data errors


I wonder if I'm running into some combination of:

6725341 Running 'zpool scrub' repeatedly on a pool show an ever
increasing error count

and maybe:

6437568 ditto block repair is incorrectly propagated to root vdev

Any way to dig further to determine what's going on?

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] J4500 device renumbering

2008-07-15 Thread Bill Sommerfeld

On Tue, 2008-07-15 at 15:32 -0500, Bob Friesenhahn wrote:
 On Tue, 15 Jul 2008, Ross Smith wrote:
 
 
  It sounds like you might be interested to read up on Eric Schrock's work.  
  I read today about some of the stuff he's been doing to bring integrated 
  fault management to Solaris:
  http://blogs.sun.com/eschrock/entry/external_storage_enclosures_in_solaris
  His last paragraph is great to see, Sun really do seem to be headed in the 
  right direction:
 
 That does sound good.  It seems like this effort is initially limited 
 to SAS enclosures.

It seems to get some info from a SE3510 jbod (fiberchannel), but doesn't
identify which disk is in each drive slot:

# /usr/lib/fm/fmd/fmtopo -V '*/ses-enclosure=0/bay=0'
TIME UUID
Jul 15 17:33:37 6033e234-94a3-ca79-9138-af1ee7f95b8d

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=0
  group: protocol   version: 1   stability: Private/Private
resource  fmri  
hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=0
label stringDisk Drives 0
FRU   fmri  
hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=0
  group: authority  version: 1   stability: Private/Private
product-idstringSUN-StorEdge-3510F-D
chassis-idstring205000c0ff086b4a
server-id string
  group: sesversion: 1   stability: Private/Private
node-id   uint640x3
target-path   string/dev/es/ses0

# /usr/lib/fm/fmd/fmtopo '*/ses-enclosure=0/*'
TIME UUID
Jul 15 17:35:23 16ff7d01-7f1d-e8ef-f8a5-d60a01d99b68

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/psu=0

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/psu=1

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/fan=0

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/fan=1

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/fan=2

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/fan=3

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=0

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=1

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=2

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=3

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=4

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=5

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=6

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=7

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=8

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=9

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=10

hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=11


- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [caiman-discuss] swap dump on ZFS volume

2008-06-24 Thread Bill Sommerfeld

On Tue, 2008-06-24 at 09:41 -0700, Richard Elling wrote:
 IMHO, you can make dump optional, with no dump being default. 
 Before Sommerfeld pounces on me (again :-))

actually, in the case of virtual machines, doing the dump *in* the
virtual machine into preallocated virtual disk blocks is silly.  if you
can break the abstraction barriers a little, I'd think it would make
more sense for the virtual machine infrastructure to create some sort of
snapshot at the time of failure which could then be converted into a
form that mdb can digest...

- Bill






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Growing root pool ?

2008-06-11 Thread Bill Sommerfeld


On Wed, 2008-06-11 at 07:40 -0700, Richard L. Hamilton wrote:
  I'm not even trying to stripe it across multiple
  disks, I just want to add another partition (from the
  same physical disk) to the root pool.  Perhaps that
  is a distinction without a difference, but my goal is
  to grow my root pool, not stripe it across disks or
  enable raid features (for now).
  
  Currently, my root pool is using c1t0d0s4 and I want
  to add c1t0d0s0 to the pool, but can't.
  
  -Wyllys
 
 Right, that's how it is right now (which the other guy seemed to
 be suggesting might change eventually, but nobody knows when
 because it's just not that important compared to other things).
 
 AFAIK, if you could shrink the partition whose data is after
 c1t0d0s4 on the disk, you could grow c1t0d0s4 by that much,
 and I _think_ zfs would pick up the growth of the device automatically.

This works.  ZFS doesn't notice the size increase until you reboot.

I've been installing systems over the past year with a slice arrangement
intended to make it easy to go to zfs root:

s0 with a ZFS pool at start of  disk
s1 swap
s3 UFS boot environment #1
s4 UFS boot environment #2
s7 SVM metadb (if mirrored root)

I was happy to discover that this paid off.  Once I upgraded a BE to
nv_90 and was running on it, it was a matter of:

lucreate -p $pool -n nv_90zfs
luactivate nv_90zfs

init 6  (reboot)

ludelete other BE's

format
format partition
delete slices other than s0
grow s0 to full disk

reboot

and you're all ZFS all the time.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] disk names?

2008-06-05 Thread Bill Sommerfeld

On Wed, 2008-06-04 at 23:12 +, A Darren Dunham wrote:
 Best story I've heard is that it dates from before the time when
 modifiable (or at least *easily* modifiable) slices didn't exist.  No
 hopping into 'format' or using 'fmthard'.  Instead, your disk came with
 an entry in 'format.dat' with several fixed slices.

format.dat?  bah.  in some systems I used - notably 4.2/4.3BSD on the
vax and some even more obscure hardware - the partition table was
*compiled into the device driver* (one table per known disk type). 

Don't like the partition layout?  you have kernel source, you can change
it...

Disk labels didn't turn up until after BSD4.3.

 So you could use the entire disk with any of:
 a,b,d,e,f,g
 a,b,d,e,h
 c

Right.  You'd typically use the a/b/d/e/f/g or a/b/d/e/h slice on your
boot disk and the c slice on additional disks.

 without having to change the label.

And the reason why changing the label was avoided was because it
required recompiling the kernel and rebooting.

 I speculate that then utilities were written that used c/2 for
 information about the entire disk and people thought keeping the
 convention going was good.

it's more like it was too painful to change.

 You can later use access to block 0 (via any slice) to corrupt (...er
 *modify*) that label, but that's not a feature of s2.  s0 would do it as
 well with the way most disks are labled (because it also contains
 cylinder 0/block 0.)

and why didn't this get fixed?  inertia.  because slices are implemented
in the disk driver by looking at the low order bits of the disk minor
number, you couldn't just wedge in an additional device instance for the
unsliced disk without taking away one slice or re-creating *all* of your
disk block  character devices.

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS root compressed ?

2008-06-05 Thread Bill Sommerfeld


On Thu, 2008-06-05 at 23:04 +0300, Cyril Plisko wrote:
 1. Are there any reasons to *not* enable compression by default ?

Not exactly an answer:

Most of the systems I'm running today on ZFS root have compression=on
and copies=2 for rpool/ROOT 

 2. How can I do it ? (I think I can run zfs set compression=on
 rpool/ROOT/snv_90 in the other window, right after the installation
 begins, but I would like less hacky way.)

what I did was to migrate via live upgrade, creating the pool and the
pool/ROOT filesystem myself, tweaking both  copies and compression on
pool/ROOT before using lucreate.

I haven't tried this on a fresh install yet.

after install, I'd think you could play games with zfs send | zfs
receive on an inactive BE to rewrite everything with the desired
attributes (more important for copies than compression).

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What is a vdev?

2008-05-23 Thread Bill Sommerfeld


On Fri, 2008-05-23 at 13:45 -0700, Orvar Korvar wrote:
 Ok, so i make one vdev out of 8 discs. And I combine all vdevs into one large 
 zpool? Is it correct?
 
 I have 8 port SATA card. I have 4 drives into one zpool.

zpool create mypool raidz1 disk0 disk1 disk2 disk3

you have a pool consisting of one vdev made up of 4 disks.

  That is one vdev, right? Now I can add 4 new drives and make them
 into one zpool.

you could do that and keep the pool separate, or you could add them as a
single vdev to the existing pool:

zpool add mypool raidz1 disk4 disk5 disk6 disk7

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ACLs/Samba integration

2008-03-17 Thread Bill Sommerfeld


On Fri, 2008-03-14 at 18:11 -0600, Mark Shellenbaum wrote:
  I think it is a misnomer to call the current
  implementation of ZFS a pure ACL system, as clearly the ACLs are heavily
  contaminated by legacy mode bits. 
 
 Feel free to open an RFE.  It may be a tough sell with PSARC, but maybe 
 if we have enough customer requests maybe they can be won over.

It is always wrong to have a mental model of PSARC as a monolithic
entity.  

I suspect at least some of the membership would be interested in this
sort of extension and it shouldn't be that hard to sell if it's not
the default behavior and it's clearly documented that turning it on
(probably on a fs-by-fs basis like every other zfs tunable) takes you
out of POSIX land.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can ZFS be event-driven or not?

2008-02-28 Thread Bill Sommerfeld


On Wed, 2008-02-27 at 13:43 -0500, Kyle McDonald wrote:
 How was it MVFS could do this without any changes to the shells or any 
 other programs?
 
 I ClearCase could  'grep FOO /dir1/dir2/file@@/main/*' to see which 
 version of 'file' added FOO.
 (I think @@ was the special hidden key. It might have been something 
 else though.)

When I last used clearcase (on the order of 12 years ago) foo@@/ only
worked within clearcase mvfs filesystems.

It behaved as if the filesystem created a foo@@ virtual directory for
each real foo directory entry, but then filtered those names out of
directory listings.

Doing the same as an alternate view on snapshot space would be a
simple matter of programming within ZFS, though the magic token/suffix
to get you into version/snapshot space would likely not be POSIX
compliant..

- Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] five megabytes per second with

2008-02-21 Thread Bill Sommerfeld


On Thu, 2008-02-21 at 11:06 -0800, John Tracy wrote:
  I've read that this behavior can be expected depending on how the LAG
 is setup, whether it divides hashes up the data on a per packet or per
 source/destination basis/or other options.

(this is a generic answer, not specific to zfs exported via iSCSI).

round-robin by packet often results in packet reordering which is often
toxic to tcp performance; tcp can misinterpret significant amounts of
reordering as a sign of packet loss due to congestion and slow down in
the name of congestion avoidance.

round-robin by flow (I'm oversimplifying, but in this context flow
generally means same addresses *and* ports) works fairly well if you
have that option and have enough different TCP connections; each
connection will go over a single path but with enough connections and
enough paths it will even out.

- Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [osol-code] /usr/bin and /usr/xpg4/bin differences

2007-12-18 Thread Bill Sommerfeld


On Sat, 2007-12-15 at 22:00 -0800, Sasidhar Kasturi wrote:
 If i want to make some modifications in the code.. Can i do it
 for /xpg4/bin commands or .. i should do it for /usr/bin commands?? 

If possible (if there's no inherent conflict with either the applicable
standards or existing practice) you should do it for both to minimize
the difference between the two variants of the commands.

I'm currently working with John Plocher to figure out why the opinion
for psarc 2005/683 (which sets precedent that divergence between command
variants should be minimized) hasn't been published, but there's a more
detailed explanation of the desired relationship
between /usr/bin, /usr/xpg4/bin, and /usr/xpg6/bin in that opinion.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What is the correct way to replace a good disk?

2007-11-02 Thread Bill Sommerfeld

On Fri, 2007-11-02 at 11:20 -0700, Chris Williams wrote:
 I have a 9-bay JBOD configured as a raidz2.  One of the disks, which
 is on-line and fine, needs to be swapped out and replaced.  I have
 been looking though the zfs admin guide and am confused on how I
 should go about swapping out.  I though I could put the disk off-line,
 remove it, put a new disk in, and put on-line.  Does this sound
 right?  

That sounds right.  You'll have improved availability if you have a
spare disk slot and can do zpool replace $pool $old $new, but offline
followed by a reconstruct-in-place via zpool replace $pool $disk also
works.

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 >

1 - 100 of 172 matches

Mail list logo