from:"Kjetil Torgrim Homme"

Re: [zfs-discuss] Consolidating a huge stack of DVDs using ZFS dedup: automation?

2010-03-02 Thread Kjetil Torgrim Homme

valrh...@gmail.com valrh...@gmail.com writes:

 I have been using DVDs for small backups here and there for a decade
 now, and have a huge pile of several hundred. They have a lot of
 overlapping content, so I was thinking of feeding the entire stack
 into some sort of DVD autoloader, which would just read each disk, and
 write its contents to a ZFS filesystem with dedup enabled. [...] That
 would allow me to consolidate a few hundred CDs and DVDs onto probably
 a terabyte or so, which could then be kept conveniently on a hard
 drive and archived to tape.

it would be inconvenient to make a dedup copy on harddisk or tape, you
could only do it as a ZFS filesystem or ZFS send stream.  it's better to
use a generic tool like hardlink(1), and just delete files afterwards
with

  find . -type f -links +1 -exec rm {} \;

(untested!  notice that using xargs or -exec rm {} + will wipe out all
copies of your duplicate files, so don't do that!)

  http://linux.die.net/man/1/hardlink

perhaps this is more convenient:
  http://netdial.caribe.net/~adrian2/fdupes.html

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Consolidating a huge stack of DVDs using ZFS dedup: automation?

2010-03-02 Thread Kjetil Torgrim Homme

Freddie Cash fjwc...@gmail.com writes:

 Kjetil Torgrim Homme kjeti...@linpro.no wrote:

 it would be inconvenient to make a dedup copy on harddisk or tape,
 you could only do it as a ZFS filesystem or ZFS send stream.  it's
 better to use a generic tool like hardlink(1), and just delete
 files afterwards with

 Why would it be inconvenient?  This is pretty much exactly what ZFS +
 dedupe is perfect for.

the duplication is not visible, so it's still a wilderness of duplicates
when you navigate the files.

 Since dedupe is pool-wide, you could create individual filesystems for
 each DVD.  Or use just 1 filesystem with sub-directories.  Or just one
 filesystem with snapshots after each DVD is copied over top.

 The data would be dedupe'd on write, so you would only have 1 copy of
 unique data.

for this application, I don't think the OP *wants* COW if he changes one
file.  he'll want the duplicates to be kept in sync, not diverging (in
contrast to storage for VMs, for instance).

with hardlinks, it is easier to identify duplicates and handle them
however you like.  if there is a reason for the duplicate access paths
to your data, you can keep them.  I would want to straighten the mess
out, though, rather than keep it intact as closely as possible.

 To save it to tape, just zfs send it, and save the stream file.

the zfs stream format is not recommended for archiving.

 ZFS dedupe would also work better than hardlinking files, as it works
 at the block layer, and will be able to dedupe partial files.

yes, but for the most part this will be negligible.  copies of growing
files, like log files, or perhaps your novel written as a stream of
conciousness, will benefit.  unrelated partially identical files are
rare.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Kjetil Torgrim Homme

Paul B. Henson hen...@acm.org writes:

 On Tue, 2 Mar 2010, Kjetil Torgrim Homme wrote:

 no.  what happens when an NFS client without ACL support mounts your
 filesystem?  your security is blown wide open.  the filemode should
 reflect the *least* level of access.  if the filemode on its own allows
 more access, then you've lost.

 Say what?

 If you're using secure NFS, access control is handled on the server
 side.  If an NFS client that doesn't support ACL's mounts the
 filesystem, it will have whatever access the user is supposed to have,
 the lack of ACL support on the client is immaterial.

this is true for AUTH_SYS, too, sorry about the bad example.  but it
doesn't really affect my point.  if you just consider the filemode to be
the lower bound for access rights, aclmode=passthrough will not give you
any nasty surprises regardless of what clients do, *and* an ACL-ignorant
client will get the behaviour it needs and wants.  win-win!

 if your ACLs are completely specified and give proper access on their
 own, and you're using aclmode=passthrough, chmod -R 000 / will not
 harm your system.

 Actually, it will destroy the three special ACE's, user@, group@, and
 every...@.  On the other hand, with a hypothetical aclmode=ignore or
 aclmode=deny, such a chmod would indeed not harm the system.

you're not using those, are you?  they are a direct mapping of the old
style permissions, so it would be pretty weird if they were allowed to
diverge.

 if you have rogue processes doing chmod a+rwx or other nonsense, you
 need to fix the rogue process, that's not an ACL problem or a problem
 with traditional Unix permissions.

 What I have are processes that don't know about ACL's. Are they
 broken? Not in and of themselves, they are simply incompatible with a
 security model they are unaware of.

you made that model.

 Why on earth would I want to go and try to make every single
 application in the world ACL aware/compatible instead of simply having
 a filesystem which I can configure to ignore any attempt to manipulate
 legacy permissions?

you don't have to.  just subscribe to the principle of least security,
and it just works.

 not at all.  you just have to use them correctly.

 I think we're just not on the same page on this; while I am not saying
 I'm on the right page, it does seem you need to do a little more
 reading up on how ACL's work.

nice insult.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Kjetil Torgrim Homme

Paul B. Henson hen...@acm.org writes:

 Good :). I am certainly not wedded to my proposal, if some other
 solution is proposed that would meet my requirements, great. However,
 pretty much all of the advice has boiled down to either ACL's are
 broken, don't use them, or why would you want to do *that*?, which
 isn't particularly useful.

you haven't demonstrated why the current capabilities are insufficient
for your requirements.  it's a bit hard to offer advice for perceived
problems other than reconsider your perception.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-28 Thread Kjetil Torgrim Homme

Paul B. Henson hen...@acm.org writes:
 On Fri, 26 Feb 2010, David Dyer-Bennet wrote:
 I think of using ACLs to extend extra access beyond what the
 permission bits grant.  Are you talking about using them to prevent
 things that the permission bits appear to grant?  Because so long as
 they're only granting extended access, losing them can't expose
 anything.

 Consider the example of creating a file in a directory which has an
 inheritable ACL for new files:

why are you doing this?  it's inherently insecure to rely on ACL's to
restrict access.  do as David says and use ACL's to *grant* access.  if
needed, set permission on the file to 000 and use umask 777.

 drwx--s--x+  2 henson   csupomona   4 Feb 27 09:21 .
 owner@:rwxpdDaARWcC--:-di---:allow
 owner@:rwxpdDaARWcC--:--:allow
 group@:--x---a-R-c---:-di---:allow
 group@:--x---a-R-c---:--:allow
  everyone@:--x---a-R-c---:-di---:allow
  everyone@:--x---a-R-c---:--:allow
 owner@:rwxpdDaARWcC--:f-i---:allow
 group@:--:f-i---:allow
  everyone@:--:f-i---:allow

 When the ACL is respected, then regardless of the requested creation
 mode or the umask, new files will have the following ACL:

 -rw---+  1 henson   csupomona   0 Feb 27 09:26 foo
 owner@:rw-pdDaARWcC--:--:allow
 group@:--:--:allow
  everyone@:--:--:allow

 Now, let's say a legacy application used a requested creation mode of
 0644, and the current umask was 022, and the application calculated
 the resultant mode and explicitly set it with chmod(0644):

why is umask 022 when you want 077?  *that's* your problem.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Oops, ran zfs destroy after renaming a folder and deleted my file system.

2010-02-25 Thread Kjetil Torgrim Homme

tomwaters tomwat...@chadmail.com writes:

 I created a zfs file system, cloud/movies and shared it.
 I then filled it with movies and music.
 I then decided to rename it, so I used rename in the Gnome to change
 the folder name to media...ie cloud/media.  MISTAKE
 I then noticed the zfs share was pointing to /cloud/movies which no
 longer exists.

I think you should file a bug against Nautilus (the GNOME file manager).
When you rename a directory, it should check for it being a mountpoint
and warn appropriately.  (adding ZFS specific code to DTRT is perhaps
asking for a bit too much.)  evidently it got an error for the rename(2)
and instead started to copy/delete the original.  *inside* some
filesystems, this is probably correct behaviour, but when the object is
a filesystem, I don't think anyone want this behaviour.  if they want to
move data off the filesystem, they should go inside, mark all files, and
drag (or ^X ^V) the files wherever they should go.

 So, I removed cloud/movies with zfs destroy --- BIGGER MISTAKE

I see the reasoning behind this, but as you've learnt the hard way:
always double-check before using zfs destroy.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Kjetil Torgrim Homme

Steve steve.jack...@norman.com writes:

 I would like to ask a question regarding ZFS performance overhead when
 having hundreds of millions of files

 We have a storage solution, where one of the datasets has a folder
 containing about 400 million files and folders (very small 1K files)

 What kind of overhead do we get from this kind of thing?

at least 50%.  I don't think this is obvious, so I'll state it: RAID-Z
will not gain you any additional capacity over mirroring in this
scenario.

remember each individual file gets its own stripe.  if the file is 512
bytes or less, you'll need another 512 byte block for the parity
(actually as a special case, it's not parity, but a copy.  parity would
just be an inversion of all bits, so it's not useful to spend time doing
it.)  what's more, even if the file is 1024 bytes or less, ZFS will
allocate an additional padding block to reduce the chance of unusable
single disk blocks.  a 1536 byte file will also consume 2048 bytes of
physical disk, however.  the reasoning for RAID-Z2 is similar, except it
will add a padding block even for the 1536 byte file.  to summarise:

net   raid-z1   raidz-2
  --
512   1024 2x   1536 3x
   1024   2048 2x   3072 3x
   1536   2048 1½x  3072 2x
   2048   3072 1½x  3072 1½x
   2560   3072 1⅕x  3584 1⅖x

the above assumes at least 8 (9) disks in the vdev, otherwise you'll get
a little more overhead for the larger filesizes.

 Our storage performance has degraded over time, and we have been
 looking in different places for cause of problems, but now I am
 wondering if its simply a file pointer issue?

adding new files will fragment directories, that might cause performance
degradation depending on access patterns.

I don't think many files in itself will cause problems, but since you
get a lot more ZFS records in your dataset (128x!), more of the disk
space is wasted on block pointers, and you may get more block pointer
writes since more levels are needed.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Kjetil Torgrim Homme

David Dyer-Bennet d...@dd-b.net writes:

 Which is bad enough if you say ls.  And there's no option to say
 don't sort that I know of, either.

/bin/ls -f

/bin/ls makes sure an alias for ls to ls -F or similar doesn't
cause extra work.  you can also write \ls -f to ignore a potential
alias.

without an argument, GNU ls and SunOS ls behave the same.  if you write
ls -f * you'll only get output for directories in SunOS, while GNU ls
will list all files.

(ls -f has been there since SunOS 4.0 at least)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-23 Thread Kjetil Torgrim Homme

Miles Nordin car...@ivy.net writes:

 kth == Kjetil Torgrim Homme kjeti...@linpro.no writes:

kth the SCSI layer handles the replaying of operations after a
kth reboot or connection failure.

 how?

 I do not think it is handled by SCSI layers, not for SAS nor iSCSI.

sorry, I was inaccurate.  error reporting is done by the SCSI layer, and
the filesystem handles it by retrying whatever outstanding operations it
has.

 Also, remember a write command that goes into the write cache is a
 SCSI command that's succeeded, even though it's not actually on disk
 for sure unless you can complete a sync cache command successfully and
 do so with no errors nor ``protocol events'' in the gap between the
 successful write and the successful sync.  A facility to replay failed
 commands won't help because when a drive with write cache on reboots,
 successful writes are rolled back.

this is true, sorry about my lack of precision.  the SCSI layer can't do
this on its own.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-22 Thread Kjetil Torgrim Homme

Miles Nordin car...@ivy.net writes:

 There will probably be clients that might seem to implicitly make this
 assuption by mishandling the case where an iSCSI target goes away and
 then comes back (but comes back less whatever writes were in its write
 cache).  Handling that case for NFS was complicated, and I bet such
 complexity is just missing without any equivalent from the iSCSI spec,
 but I could be wrong.  I'd love to be educated.

 Even if there is some magical thing in iSCSI to handle it, the magic
 will be rarely used and often wrong until peopel learn how to test it,
 which they haven't yet they way they have with NFS.

I decided I needed to read up on this and found RFC 3783 which is very
readable, highly recommended:

  http://tools.ietf.org/html/rfc3783

basically iSCSI just defines a reliable channel for SCSI.  the SCSI
layer handles the replaying of operations after a reboot or connection
failure.  as far as I understand it, anyway.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] improve meta data performance

2010-02-19 Thread Kjetil Torgrim Homme

Chris Banal cba...@gmail.com writes:

 We have a SunFire X4500 running Solaris 10U5 which does about 5-8k nfs
 ops of which about 90% are meta data. In hind sight it would have been
 significantly better  to use a mirrored configuration but we opted for
 4 x (9+2) raidz2 at the time. We can not take the downtime necessary
 to change the zpool configuration.

 We need to improve the meta data performance with little to no
 money. Does anyone have any suggestions?

I believe the latest Solaris update will improve metadata caching.
always good to be up-to-date on patches, no?

 Is there such a thing as a Sun supported NVRAM PCI-X card compatible
 with the X4500 which can be used as an L2ARC?

I think they only have PCIe, and it hardly qualifies as little to no
money.

  http://www.sun.com/storage/disk_systems/sss/f20/specs.xml

I'll second the recommendations for Intel X25-M for L2ARC if you can
spare a SATA slot for it.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS slowness under domU high load

2010-02-14 Thread Kjetil Torgrim Homme

Bogdan Ćulibrk b...@default.rs writes:

 What are my options from here? To move onto zvol with greater
 blocksize? 64k? 128k? Or I will get into another trouble going that
 way when I have small reads coming from domU (ext3 with default
 blocksize of 4k)?

yes, definitely.  have you considered using NFS rather than zvols for
the data filesystems?  (keep zvol for the domU software.)

it's strange that you see so much write activity during backup -- I'd
expect that to do just reads...  what's going on at the domU?

generally, the best way to improve performance is to add RAM for ARC
(512 MiB is *very* little IMHO) and SSD for your ZIL, but it does seem
to be a poor match for your concept of many small low-cost dom0's.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-10 Thread Kjetil Torgrim Homme

Eric D. Mudama edmud...@bounceswoosh.org writes:
 On Tue, Feb  9 at  2:36, Kjetil Torgrim Homme wrote:
 no one is selling disk brackets without disks.  not Dell, not EMC,
 not NetApp, not IBM, not HP, not Fujitsu, ...

 http://discountechnology.com/Products/SCSI-Hard-Drive-Caddies-Trays

very nice, thanks.  unfortunately it probably won't last:

[http://lists.us.dell.com/pipermail/linux-poweredge/2010-February/041335.html]:
|
| In the case of Dell's PERC RAID controllers, we began informing
| customers when a non-Dell drive was detected with the introduction of
| PERC5 RAID controllers in early 2006. With the introduction of the
| PERC H700/H800 controllers, we began enabling only the use of Dell
| qualified drives.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-10 Thread Kjetil Torgrim Homme

Bob Friesenhahn bfrie...@simple.dallas.tx.us writes:
 On Wed, 10 Feb 2010, Frank Cusack wrote:

 The other three commonly mentioned issues are:

  - Disable the naggle algorithm on the windows clients.

for iSCSI?  shouldn't be necessary.

  - Set the volume block size so that it matches the client filesystem
block size (default is 128K!).

default for a zvol is 8 KiB.

  - Check for an abnormally slow disk drive using 'iostat -xe'.

his problem is lazy ZFS, notice how it gathers up data for 15 seconds
before flushing the data to disk.  tweaking the flush interval down
might help.

 An iostat -xndz 1 readout of the %b% coloum during a file copy to
 the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2
 seconds of 100, and repeats.

what are the other values?  ie., number of ops and actual amount of data
read/written.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-10 Thread Kjetil Torgrim Homme

[please don't top-post, please remove CC's, please trim quotes.  it's
 really tedious to clean up your post to make it readable.]

Marc Nicholas geekyth...@gmail.com writes:
 Brent Jones br...@servuhome.net wrote:
 Marc Nicholas geekyth...@gmail.com wrote:
 Kjetil Torgrim Homme kjeti...@linpro.no wrote:
 his problem is lazy ZFS, notice how it gathers up data for 15
 seconds before flushing the data to disk.  tweaking the flush
 interval down might help.

 How does lowering the flush interval help? If he can't ingress data
 fast enough, faster flushing is a Bad Thibg(tm).

if network traffic is blocked during the flush, you can experience
back-off on both the TCP and iSCSI level.

 what are the other values?  ie., number of ops and actual amount of
 data read/written.

this remained unanswered.

 ZIL performance issues? Is writecache enabled on the LUNs?
 This is a Windows box, not a DB that flushes every write.

have you checked if the iSCSI traffic is synchronous or not?  I don't
use Windows, but other reports on the list have indicated that at least
the NTFS format operation *is* synchronous.  use zilstats to see.

 The drives are capable of over 2000 IOPS (albeit with high latency as
 its NCQ that gets you there) which would mean, even with sync flushes,
 8-9MB/sec.

2000 IOPS is the aggregate, but the disks are set up as *one* RAID-Z2!
NCQ doesn't help much, since the write operations issued by ZFS are
already ordered correctly.

the OP may also want to try tweaking metaslab_df_free_pct, this helped
linear write performance on our Linux clients a lot:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6869229

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?

2010-02-09 Thread Kjetil Torgrim Homme

Richard Elling richard.ell...@gmail.com writes:

 On Feb 8, 2010, at 9:10 PM, Damon Atkins wrote:

 I would have thought that if I write 1k then ZFS txg times out in
 30secs, then the 1k will be written to disk in a 1k record block, and
 then if I write 4k then 30secs latter txg happen another 4k record
 size block will be written, and then if I write 130k a 128k and 2k
 record block will be written.
 
 Making the file have record sizes of
 1k+4k+128k+2k

 Close. Once the max record size is achieved, it is not reduced.  So
 the allocation is: 1KB + 4KB + 128KB + 128KB

I think the above is easily misunderstood.  I assume the OP means
append, not rewrites, and in that case (with recordsize=128k):

* after the first write, the file will consist of a single 1 KiB record.
* after the first append, the file will consist of a single 5 KiB
  record.
* after the second append, one 128 KiB record and one 7 KiB record.

in each of these operations, the *whole* file will be rewritten to a new
location, but after a third append, only the tail record will be
rewritten.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?

2010-02-09 Thread Kjetil Torgrim Homme

Neil Perrin neil.per...@sun.com writes:

 On 02/09/10 08:18, Kjetil Torgrim Homme wrote:
 I think the above is easily misunderstood.  I assume the OP means
 append, not rewrites, and in that case (with recordsize=128k):

 * after the first write, the file will consist of a single 1 KiB record.
 * after the first append, the file will consist of a single 5 KiB
   record.

 Good so far.

 * after the second append, one 128 KiB record and one 7 KiB record.

 A long time ago we used to write short tail blocks, but not any more.
 So after the 2nd append we actually have 2 128KB blocks.

thanks a lot for the correction!

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-08 Thread Kjetil Torgrim Homme

Daniel Carosone d...@geek.com.au writes:

 In that context, I haven't seen an answer, just a conclusion: 

  - All else is not equal, so I give my money to some other hardware
manufacturer, and get frustrated that Sun won't let me buy the
parts I could use effectively and comfortably.  

no one is selling disk brackets without disks.  not Dell, not EMC, not
NetApp, not IBM, not HP, not Fujitsu, ...

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Intrusion Detection - powered by ZFS Checksumming ?

2010-02-08 Thread Kjetil Torgrim Homme

Damon Atkins damon_atk...@yahoo.com.au writes:

 One problem could be block sizes, if a file is re-written and is the
 same size it may have different ZFS record sizes within, if it was
 written over a long period of time (txg's)(ignoring compression), and
 therefore you could not use ZFS checksum to compare two files.

the record size used for a file is chosen when that file is created.  it
can't change.  when the default record size for the dataset changes,
only new files will be affected.  ZFS *must* write a complete record
even if you change just one byte (unless it's the tail record of
course), since there isn't any better granularity for the block
pointers.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [OT] excess zfs-discuss mailman digests

2010-02-08 Thread Kjetil Torgrim Homme

grarpamp grarp...@gmail.com writes:

 PS: Is there any way to get a copy of the list since inception for
 local client perusal, not via some online web interface?

I prefer to read mailing lists using a newsreader and the NNTP interface
at Gmane.  a newsreader tends to be better at threading etc. than a mail
client which is fed an mbox...  see http://gmane.org/about.php for more
information.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-07 Thread Kjetil Torgrim Homme

Tim Cook t...@cook.ms writes:
 Kjetil Torgrim Homme kjeti...@linpro.no wrote:
I don't know what the J4500 drive sled contains, but for the J4200
and J4400 they need to include quite a bit of circuitry to handle
SAS protocol all the way, for multipathing and to be able to
accept a mix of SAS and SATA drives.  it's not just a piece of
sheet metal, some plastic and a LED.

the pricing does look strange, and I think it would be better to
raise the price of the enclosure (which is silly cheap when empty
IMHO) and reduce the drive prices somewhat.  but that's just
psychology, and doesn't really matter for total cost.

 Why exactly would that be better?

people are looking at the price list and seeing that the J4200 costs
22550 NOK [1], while one sixpack of 2TB SATA disks to go with it costs
82500 NOK.  on the other hand you could get six 2TB SATA disks
(Ultrastars) from your friendly neighbourhood shop for 14370 NOK (7700
NOK for six Deskstars).  and to add insult to injury, the neighbourhood
shop offers five years warranty (required by Norwegian consumer laws),
compared to Sun's three years...

everyone knows the price of a harddisk since they buy them for their
home computers.  do they know the price of a disk storage array?  not so
well.  yes, it's a matter of perception for the buyer, but perception
can matter.

 Then it's a high cost of entry.   What if an SMB only needs 6 drives
 day one?  Why charge them an arm and a leg for the enclosure, and
 nothing for the drives?  Again, the idea is that you're charging based
 on capacity.

see my numbers above.  the chassis itself is just 12% of the cost (22%
when half full).  some middle ground could be found.

anyway, we're buying these systems and are very happy with them.  when
disks fail, Sun replace them very expediently and with a minimum of
fuss.


[1] all prices include VAT to simplify comparison.  prices are current
from shop.sun.com and komplett.no.  Sun list prices are subject to
haggling.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-07 Thread Kjetil Torgrim Homme

Christo Kutrovsky kutrov...@pythian.com writes:

 Has anyone seen soft corruption in NTFS iSCSI ZVOLs after a power
 loss?

this is not from experience, but I'll answer anyway.

 I mean, there is no guarantee writes will be executed in order, so in
 theory, one could corrupt it's NTFS file system.

I think you have that guarantee, actually.

the problem is that the Windows client will think that block N has been
updated, since the iSCSI server told it it was commited to stable
storage.  however, when ZIL is disabled, that update may get lost during
power loss.  if block N contains, say, directory information, this could
cause weird behaviour.  it may look fine at first -- the problem won't
appear until NTFS has thrown block N out of its cache and it needs to
re-read it from the server.  when the re-read stale data is combined
with fresh data from RAM, it's panic time...

 Would best practice be to rollback the last snapshot before making
 those iSCSI available again?

I think you need to reboot the client so that its RAM cache is cleared
before any other writes are made.

a rollback shouldn't be necessary.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 3ware 9650 SE

2010-02-06 Thread Kjetil Torgrim Homme

Alexandre MOREL almo...@gmail.com writes:

 It's a few day now that I try to use a 9650SE 3ware controller to work
 on opensolaris and I found the following problem : the tw driver seems
 work, I can see my controller whith the tw_cli of 3ware. I can see
 that 2 drives are created with the controller, but when I try to use
 pfexec format, it doesn't detect the drive.

did you create logical devices using tw_cli?  a pity none of these cards
seem to support proper JBOD mode.  in 9650's case it's especially bad,
since pulling a drive will *renumber* the logical devices without
notifying the OS.  quite scary!

if you've created the devices, try running devfsadm once more.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-06 Thread Kjetil Torgrim Homme

matthew patton patto...@yahoo.com writes:

 true. but I buy a Ferrari for the engine and bodywork and chassis
 engineering. It is totally criminal what Sun/EMC/Dell/Netapp do
 charging customers 10x the open-market rate for standard drives. A
 RE3/4 or NS drive is the same damn thing no matter if I buy it from
 ebay or my local distributor. Dell/Sun/Netapp buy drives by the
 container load. Oh sure, I don't mind paying an extra couple
 pennies/GB for all the strenuous efforts the vendors spend on firmware
 verification (HA!).

I don't know what the J4500 drive sled contains, but for the J4200 and
J4400 they need to include quite a bit of circuitry to handle SAS
protocol all the way, for multipathing and to be able to accept a mix of
SAS and SATA drives.  it's not just a piece of sheet metal, some plastic
and a LED.

the pricing does look strange, and I think it would be better to raise
the price of the enclosure (which is silly cheap when empty IMHO) and
reduce the drive prices somewhat.  but that's just psychology, and
doesn't really matter for total cost.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-06 Thread Kjetil Torgrim Homme

Frank Cusack frank+lists/z...@linetwo.net writes:

 On 2/4/10 8:00 AM +0100 Tomas Ögren wrote:
 The find -newer blah suggested in other posts won't catch newer
 files with an old timestamp (which could happen for various reasons,
 like being copied with kept timestamps from somewhere else).

 good point.  that is definitely a restriction with find -newer.  but
 if you meet that restriction, and don't need to find added or deleted
 files, it will be faster since only 1 directory tree has to be walked.

FWIW, GNU find has -cnewer

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] list new files/activity monitor

2010-02-06 Thread Kjetil Torgrim Homme

Nilsen, Vidar vidar.nil...@palantir.no writes:

 And once an hour I run a script that checks for new dirs last 60
 minutes matching some criteria, and outputs the path to an
 IRC-channel. Where we can see if someone else has added new stuff.

 Method used is “find –mmin -60”, which gets horrible slow when more
 data is added.

 My question is if there exists some method I can get the same results
 but based on events rather than seeking through everything once an
 hour.

yes, File Events Notification (FEN)

  http://blogs.sun.com/praks/entry/file_events_notification

you access this through the event port API.

  http://developers.sun.com/solaris/articles/event_completion.html

gnome-vfs uses FEN, but unfortunately gnomevfs-monitor will only watch a
specific directory.  I think you'll need to write your own code to watch
all directories in a tree.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 3ware 9650 SE

2010-02-01 Thread Kjetil Torgrim Homme

Tiernan O'Toole lsmart...@gmail.com writes:

 looking at the 3ware 9650 SE raid controller for a new build... anyone
 have any luck with this card? their site says they support
 OpenSolaris... anyone used one?

didn't work too well for me.  it's fast and nice for a couple of days,
then the driver gets slower and slower, and eventually it gets stuck and
all I/O freezes.  preventive reboots were needed.  I used the newest
driver from 3ware/AMCC with 2008.11 and 2009.05.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS compressed ration inconsistency

2010-02-01 Thread Kjetil Torgrim Homme

antst ant.stari...@gmail.com writes:

 I'm more than happy by the fact that data consumes even less physical
 space on storage.  But I want to understand why and how. And want to
 know to what numbers I can trust.

my guess is sparse files.

BTW, I think you should compare the size returned from du -bx with
refer, not used.  in this case it's not snapshots which makes the
difference, though.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?

2010-01-31 Thread Kjetil Torgrim Homme

Mark Bennett mark.benn...@public.co.nz writes:

 Update:

 For the WD10EARS, the blocks appear to be aligned on the 4k boundary
 when zfs uses the whole disk (whole disk as EFI partition).

 Part  TagFlag First Sector Size Last Sector
  0usrwm256   931.51Gb  1953508750
  
  calc256*512/4096=32

I'm afraid this isn't enough.  if you enable compression, any ZFS write
can be unaligned.  also, if you're using raid-z with an odd number of
data disks, some of (most of?) your stripes will be misaligned.

ZFS needs to use 4096 octets as the basic block to fully exploit
the performance of these disks.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Building big cheap storage system. What hardware to use?

2010-01-28 Thread Kjetil Torgrim Homme

Freddie Cash fjwc...@gmail.com writes:

 We use the following for our storage servers:
 [...]
 3Ware 9650SE PCIe RAID controller (12-port, muli-lane)
 [...]
 Fully supported by FreeBSD, so everything should work with
 OpenSolaris.

FWIW, I've used the 9650SE with 16 ports in OpenSolaris 2008.11 and
2009.06, and had problems with the driver just hanging after 4-5 days of
use.  iostat would report 100% busy on all drives connected to the card,
and even uadmin 1 1 (low-level reboot command) was ineffective.  I had
to break into the debugger and do the reboot from there.  I was using
the newest driver from AMCC.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-25 Thread Kjetil Torgrim Homme

Mike Gerdts mger...@gmail.com writes:

 Kjetil Torgrim Homme wrote:
 Mike Gerdts mger...@gmail.com writes:

 John Hoogerdijk wrote:
 Is there a way to zero out unused blocks in a pool?  I'm looking for
 ways to shrink the size of an opensolaris virtualbox VM and using the
 compact subcommand will remove zero'd sectors.

 I've long suspected that you should be able to just use mkfile or dd
 if=/dev/zero ... to create a file that consumes most of the free
 space then delete that file.  Certainly it is not an ideal solution,
 but seems quite likely to be effective.

 you'll need to (temporarily) enable compression for this to have an
 effect, AFAIK.

 (dedup will obviously work, too, if you dare try it.)

 You are missing the point.  Compression and dedup will make it so that
 the blocks in the devices are not overwritten with zeroes.  The goal
 is to overwrite the blocks so that a back-end storage device or
 back-end virtualization platform can recognize that the blocks are not
 in use and as such can reclaim the space.

aha, I was assuming the OP's VirtualBox image was stored on ZFS, but I
realise now that it's the other way around -- he's running ZFS inside a
VirtualBox image hosted on a traditional filesystem.  in that case
you're right, and I'm wrong :-)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] optimise away COW when rewriting the same data?

2010-01-24 Thread Kjetil Torgrim Homme

I was looking at the performance of using rsync to copy some large files
which change only a little between each run (database files).  I take a
snapshot after every successful run of rsync, so when using rsync
--inplace, only changed portions of the file will occupy new disk space.

Unfortunately, performance wasn't too good, the source server in
question simply didn't have much CPU to perform the rsync delta
algorithm, and in addition it creates read I/O load on the destination
server.  So I had to switch it off and transfer the whole file instead.
In this particular case, that means I need 120 GB to store each run
rather than 10, but that's the way it goes.

If I had enabled deduplication, this would be a moot point, dedup would
take care of it for me.  Judging from early reports my server
will probably not have the required oomph to handle it, so I'm holding
off until I get to replace it with a server with more RAM and CPU.

But it occured to me that this is a special case which could be
beneficial in many cases -- if the filesystem uses secure checksums, it
could check the existing block pointer and see if the replaced data
matches.  (Due to the (infinitesimal) potential for hash collisions this
should be configurable the same way it is for dedup.)  In essence,
rsync's writes would become no-ops, and very little CPU would be wasted
on either side of the pipe.

Even in the absence of snapshots, this would leave the filesystem less
fragmented, since the COW is avoided.  This would be a win-win if the
ZFS pipeline can communicate the correct information between layers.

Are there any ZFS hackers who can comment on the feasibility of this
idea?

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best 1.5TB drives for consumer RAID?

2010-01-24 Thread Kjetil Torgrim Homme

Tim Cook t...@cook.ms writes:

 On Sat, Jan 23, 2010 at 7:57 PM, Frank Cusack fcus...@fcusack.com wrote:

 I mean, just do a triple mirror of the 1.5TB drives rather than
 say (6) .5TB drives in a raidz3.

 I bet you'll get the same performance out of 3x1.5TB drives you get
 out of 6x500GB drives too.

no, it will be much better.  you get 3 independent disks available for
reading, so 3x the IOPS.  in a 6x500 GB setup all disks will need to
operate in tandem, for both reading and writing.  even if the larger
disks are slower than small disks, the difference is not even close to
such a factor.  perhaps 20% fewer IOPS?

 Are you really trying to argue people should never buy anything but
 the largest drives available?

I don't think that's Frank's point.  the key here is the advantages of a
(triple) mirroring over RAID-Z.  it just so happens that it makes
economic sense.  (think about savings in power, too.)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] optimise away COW when rewriting the same data?

2010-01-24 Thread Kjetil Torgrim Homme

David Magda dma...@ee.ryerson.ca writes:

 On Jan 24, 2010, at 10:26, Kjetil Torgrim Homme wrote:

 But it occured to me that this is a special case which could be
 beneficial in many cases -- if the filesystem uses secure checksums,
 it could check the existing block pointer and see if the replaced
 data matches.  [...]

 Are there any ZFS hackers who can comment on the feasibility of this
 idea?

 There is a bug that requests an API in ZFS' DMU library to get
 checksum data:

   6856024 - DMU checksum API
   http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6856024

That would work, but it would require rsync to do checksum calculations
itself to do the comparison.  Then ZFS would recalculate the checksum if
the data was actually written, so it's wasting work for local copies.
It would be interesting to extend the rsync protocol to take advantage
of this, though, so that the checksum can be calculated on the remote
host.  H...  It would need very ZFS specific support, e.g., the
recordsize is potentially different for each file, likewise for the
checksum algorithm.

Fixing a library seems easier than patching the kernel, so your approach
is probably better anyhow.

 It specifically mentions Lustre, and not anything like the ZFS POSIX
 interface to files (ZPL). There's also:

 Here's another: file comparison based on values derived from files'
 checksum or dnode block pointer. This would allow for very efficient
 file comparison between filesystems related by cloning. Such values
 might be made available through an extended attribute, say.

 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6366224

 It's been brought up before on zfs-discuss: the two options would be
 linking against some kind of ZFS-specific library, or using an ioctl()
 of some kind. As it stands, ZFS is really the only mainstream(-ish)
 file system that does checksums, and so there's no standard POSIX call
 for such things. Perhaps as more file systems add this functionality
 something will come of it.

The checksum algorithms need to be very strictly specified.  Not a
problem for sha256, I guess, but fletcher4 probably don't have
independent implementations which are 100% compatible with ZFS -- and
GPL (needed for rsync and many other applications).

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-11 Thread Kjetil Torgrim Homme

Lutz Schumann presa...@storageconcepts.de writes:

 Actually the performance decrease when disableing the write cache on
 the SSD is aprox 3x (aka 66%).

for this reason, you want a controller with battery backed write cache.
in practice this means a RAID controller, even if you don't use the RAID
functionality.  of course you can buy SSDs with capacitors, too, but I
think that will be more expensive, and it will restrict your choice of
model severely.

(BTW, thank you for testing forceful removal of power.  the result is as
expected, but it's good to see that theory and practice match.)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-05 Thread Kjetil Torgrim Homme

Brad bene...@yahoo.com writes:

 Hi Adam,

I'm not Adam, but I'll take a stab at it anyway.

BTW, your crossposting is a bit confusing to follow, at least when using
gmane.org.  I think it is better to stick to one mailing list anyway?

 From your the picture, it looks like the data is distributed evenly
 (with the exception of parity) across each spindle then wrapping
 around again (final 4K) - is this one single write operation or two?

it is a single write operation per device.  actually, it may be less
than one write operation since the transaction group, which probably
contains many more updates, is written as a whole.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-18 Thread Kjetil Torgrim Homme

Darren J Moffat darr...@opensolaris.org writes:

 Kjetil Torgrim Homme wrote:

 I don't know how tightly interwoven the dedup hash tree and the block
 pointer hash tree are, or if it is all possible to disentangle them.

 At the moment I'd say very interwoven by design.

 conceptually it doesn't seem impossible, but that's easy for me to
 say, with no knowledge of the zio pipeline...

 Correct it isn't impossible but instead there would probably need to
 be two checksums held, one of the untransformed data (ie uncompressed
 and unencrypted) and one of the transformed data (compressed and
 encrypted). That has different tradeoffs and SHA256 can be expensive
 too see:

 http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via

great work!  SHA256 is more expensive than I thought, even with
misc/sha2 it takes 1 ms per 128 KiB?  that's roughly the same CPU usage
as lzjb!  in that case hashing the (smaller) compressed data is more
efficient than doing an additional hash of the full uncompressed block.

it's interesting to note that 64 KiB looks faster (a bit hard to read
the chart accurately), L1 cache size coming into play, perhaps?

 Note also that the compress/encrypt/checksum and the dedup are
 separate pipeline stages so while dedup is happening for block N block
 N+1 can be getting transformed - so this is designed to take advantage
 of multiple scheduling units (threads,cpus,cores etc).

nice.  are all of them separate stages, or are compress/encrypt/checksum
done as one stage?

 oh, how does encryption play into this?  just don't?  knowing that
 someone else has the same block as you is leaking information, but that
 may be acceptable -- just make different pools for people you don't
 trust.

 compress, encrypt, checksum, dedup.

 You are correct that it is an information leak but only within a
 dataset and its clones and only if you can observe the deduplication
 stats (and you need to use zdb to get enough info to see the leak -
 and that means you have access to the raw devices), the deupratio
 isn't really enough unless the pool is really idle or has only one
 user writing at a time.

 For the encryption case deduplication of the same plaintext block will
 only work within a dataset or a clone of it - because only in those
 cases do you have the same key (and the way I have implemented the IV
 generation for AES CCM/GCM mode ensures that the same plaintext will
 have the same IV so the ciphertexts will match).

makes sense.

 Also if you place a block in an unencrypted dataset that happens to
 match the ciphertext in an encrypted dataset they won't dedup either
 (you need to understand what I've done with the AES CCM/GCM MAC and
 the zio_chksum_t field in the blkptr_t and how that is used by dedup
 to see why).

wow, I didn't think of that problem.  did you get bitten by wrongful
dedup during testing with image files? :-)

 If that small information leak isn't acceptable even within the
 dataset then don't enable both encryption and deduplication on those
 datasets - and don't delegate that property to your users either.  Or
 you can frequently rekey your per dataset data encryption keys 'zfs
 key -K' but then you might as well turn dedup off - other there are
 some very good usecases in multi level security where doing
 dedup/encryption and rekey provides a nice effect.

indeed.  ZFS is extremely flexible.

thank you for your response, it was very enlightening.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedup existing data

2009-12-18 Thread Kjetil Torgrim Homme

Anil an...@entic.net writes:

 If you have another partition with enough space, you could technically
 just do:

 mv src /some/other/place
 mv /some/other/place src

 Anyone see a problem with that? Might be the best way to get it
 de-duped.

I get uneasy whenever I see mv(1) used to move directory trees between
filesystems, that is, whenever mv(1) can't do a simple rename(2), but
has to do a recursive copy of files.  it is essentially not restartable,
if mv(1) is interrupted, you must clean up the mess with rsync or
similar tools.  so why not use rsync from the get go?  (or zfs send/recv
of course.)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-17 Thread Kjetil Torgrim Homme

Andrey Kuzmin andrey.v.kuz...@gmail.com writes:

 Downside you have described happens only when the same checksum is
 used for data protection and duplicate detection. This implies sha256,
 BTW, since fletcher-based dedupe has been dropped in recent builds.

if the hash used for dedup is completely separate from the hash used for
data protection, I don't see any downsides to computing the dedup hash
from uncompressed data.  why isn't it?

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-17 Thread Kjetil Torgrim Homme

Darren J Moffat darr...@opensolaris.org writes:
 Kjetil Torgrim Homme wrote:
 Andrey Kuzmin andrey.v.kuz...@gmail.com writes:

 Downside you have described happens only when the same checksum is
 used for data protection and duplicate detection. This implies sha256,
 BTW, since fletcher-based dedupe has been dropped in recent builds.

 if the hash used for dedup is completely separate from the hash used
 for data protection, I don't see any downsides to computing the dedup
 hash from uncompressed data.  why isn't it?

 It isn't separate because that isn't how Jeff and Bill designed it.

thanks for confirming that, Darren.

 I think the design the have is great.

I don't disagree.

 Instead of trying to pick holes in the theory can you demonstrate a
 real performance problem with compression=on and dedup=on and show
 that it is because of the compression step ?

compression requires CPU, actually quite a lot of it.  even with the
lean and mean lzjb, you will get not much more than 150 MB/s per core or
something like that.  so, if you're copying a 10 GB image file, it will
take a minute or two, just to compress the data so that the hash can be
computed so that the duplicate block can be identified.  if the dedup
hash was based on uncompressed data, the copy would be limited by
hashing efficiency (and dedup tree lookup).

I don't know how tightly interwoven the dedup hash tree and the block
pointer hash tree are, or if it is all possible to disentangle them.

conceptually it doesn't seem impossible, but that's easy for me to
say, with no knowledge of the zio pipeline...

oh, how does encryption play into this?  just don't?  knowing that
someone else has the same block as you is leaking information, but that
may be acceptable -- just make different pools for people you don't
trust.

 Otherwise if you want it changed code it up and show how what you have
 done is better in all cases.

I wish I could :-)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-16 Thread Kjetil Torgrim Homme

Andrey Kuzmin andrey.v.kuz...@gmail.com writes:

 Kjetil Torgrim Homme wrote:
 for some reason I, like Steve, thought the checksum was calculated on
 the uncompressed data, but a look in the source confirms you're right,
 of course.

 thinking about the consequences of changing it, RAID-Z recovery would be
 much more CPU intensive if hashing was done on uncompressed data --

 I don't quite see how dedupe (based on sha256) and parity (based on
 crc32) are related.

I tried to hint at an explanation:

 every possible combination of the N-1 disks would have to be
 decompressed (and most combinations would fail), and *then* the
 remaining candidates would be hashed to see if the data is correct.

the key is that you don't know which block is corrupt.  if everything is
hunky-dory, the parity will match the data.  parity in RAID-Z1 is not a
checksum like CRC32, it is simply XOR (like in RAID 5).  here's an
example with four data disks and one paritydisk:

  D1  D2  D3  D4  PP
  00  01  10  10  01

this is a single stripe with 2-bit disk blocks for simplicity.  if you
XOR together all the blocks, you get 00.  that's the simple premise for
reconstruction -- D1 = XOR(D2, D3, D4, PP), D2 = XOR(D1, D3, D4, PP) and
so on.

so what happens if a bit flips in D4 and it becomes 00?  the total XOR
isn't 00 anymore, it is 10 -- something is wrong.  but unless you get a
hardware signal from D4, you don't know which block is corrupt.  this is
a major problem with RAID 5, the data is irrevocably corrupt.  the
parity discovers the error, and can alert the user, but that's the best
it can do.  in RAID-Z the hash saves the day: first *assume* D1 is bad
and reconstruct it from parity.  if the hash for the block is OK, D1
*was* bad.  otherwise, assume D2 is bad.  and so on.

so, the parity calculation will indicate which stripes contain bad
blocks.  but the hashing, the sanity check for which disk blocks are
actually bad must be calculated over all the stripes a ZFS block
(record) consists of.

 this would be done on a per recordsize basis, not per stripe, which
 means reconstruction would fail if two disk blocks (512 octets) on
 different disks and in different stripes go bad.  (doing an exhaustive
 search for all possible permutations to handle that case doesn't seem
 realistic.)

actually this is the same for compression before/after hashing.  it's
just that each permutation is more expensive to check.

 in addition, hashing becomes slightly more expensive since more data
 needs to be hashed.

 overall, my guess is that this choice (made before dedup!) will give
 worse performance in normal situations in the future, when dedup+lzjb
 will be very common, at a cost of faster and more reliable resilver.  in
 any case, there is not much to be done about it now.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-16 Thread Kjetil Torgrim Homme

Andrey Kuzmin andrey.v.kuz...@gmail.com writes:
 Yet again, I don't see how RAID-Z reconstruction is related to the
 subject discussed (what data should be sha256'ed when both dedupe and
 compression are enabled, raw or compressed ). sha256 has nothing to do
 with bad block detection (may be it will when encryption is
 implemented, but for now sha256 is used for duplicate candidates
 look-up only).

how do you think RAID-Z resilvering works?  please correct me where I'm
wrong.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-16 Thread Kjetil Torgrim Homme

Andrey Kuzmin andrey.v.kuz...@gmail.com writes:
 Darren J Moffat wrote:
 Andrey Kuzmin wrote:
 Resilvering has noting to do with sha256: one could resilver long
 before dedupe was introduced in zfs.

 SHA256 isn't just used for dedup it is available as one of the
 checksum algorithms right back to pool version 1 that integrated in
 build 27.

 'One of' is the key word. And thanks for code pointers, I'll take a
 look.

I didn't mention sha256 at all :-).  the reasoning is the same no matter
what hash algorithm you're using (fletcher2, fletcher4 or sha256.  dedup
doesn't require sha256 either, you can use fletcher4.

the question was: why does data have to be compressed before it can be
recognised as a duplicate?  it does seem like a waste of CPU, no?  I
attempted to show the downsides to identifying blocks by their
uncompressed hash.  (BTW, it doesn't affect storage efficiency, the same
duplicate blocks will be discovered either way.)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-15 Thread Kjetil Torgrim Homme

Robert Milkowski mi...@task.gda.pl writes:
 On 13/12/2009 20:51, Steve Radich, BitShop, Inc. wrote:
 Because if you can de-dup anyway why bother to compress THEN check?
 This SEEMS to be the behaviour - i.e. I would suspect many of the
 files I'm writing are dups - however I see high cpu use even though
 on some of the copies I see almost no disk writes.

 First, the checksum is calculated after compression happens.

for some reason I, like Steve, thought the checksum was calculated on
the uncompressed data, but a look in the source confirms you're right,
of course.

thinking about the consequences of changing it, RAID-Z recovery would be
much more CPU intensive if hashing was done on uncompressed data --
every possible combination of the N-1 disks would have to be
decompressed (and most combinations would fail), and *then* the
remaining candidates would be hashed to see if the data is correct.

this would be done on a per recordsize basis, not per stripe, which
means reconstruction would fail if two disk blocks (512 octets) on
different disks and in different stripes go bad.  (doing an exhaustive
search for all possible permutations to handle that case doesn't seem
realistic.)

in addition, hashing becomes slightly more expensive since more data
needs to be hashed.

overall, my guess is that this choice (made before dedup!) will give
worse performance in normal situations in the future, when dedup+lzjb
will be very common, at a cost of faster and more reliable resilver.  in
any case, there is not much to be done about it now.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] will deduplication know about old blocks?

2009-12-09 Thread Kjetil Torgrim Homme

I'm planning to try out deduplication in the near future, but started
wondering if I can prepare for it on my servers.  one thing which struck
me was that I should change the checksum algorithm to sha256 as soon as
possible.  but I wonder -- is that sufficient?  will the dedup code know
about old blocks when I store new data?

let's say I have an existing file img0.jpg.  I turn on dedup, and copy
it twice, to img0a.jpg and img0b.jpg.  will all three files refer to the
same block(s), or will only img0a and img0b share blocks?

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] will deduplication know about old blocks?

2009-12-09 Thread Kjetil Torgrim Homme

Adam Leventhal a...@eng.sun.com writes:
 Unfortunately, dedup will only apply to data written after the setting
 is enabled. That also means that new blocks cannot dedup against old
 block regardless of how they were written. There is therefore no way
 to prepare your pool for dedup -- you just have to enable it when
 you have the new bits.

thank you for the clarification!
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Accidentally added disk instead of attaching

2009-12-08 Thread Kjetil Torgrim Homme

Daniel Carosone d...@geek.com.au writes:

 Not if you're trying to make a single disk pool redundant by adding
 ..  er, attaching .. a mirror; then there won't be such a warning,
 however effective that warning might or might not be otherwise.

 Not a problem because you can then detach the vdev and add it.

 It's a problem if you're trying to do that, but end up adding instead
 of attaching, which you can't (yet) undo.

at least in that case the amount of data shuffling you have to do is
limited to one disk (it's unlikely you do this mistake for a multi
device vdev).

in any case, the block rewrite implementation isn't *that* far away, is
it?
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] nodiratime support in ZFS?

2009-12-07 Thread Kjetil Torgrim Homme

I was catching up on old e-mail on this list, and came across a blog
posting from Henrik Johansson:

  http://sparcv9.blogspot.com/2009/10/curious-case-of-strange-arc.html

it tells of his woes with a fragmented /var/pkg/downloads combined
with atime updates.  I see the same problem on my servers, e.g. 

  $ time du -s /var/pkg/download
  1614308 /var/pkg/download
  real11m50.682s

  $ time du -s /var/pkg/download
  1614308 /var/pkg/download
  real12m03.395s

on this server, increasing arc_meta_limit wouldn't help, but I think
a newer kernel would be more aggressive (this is 2008.11).

  arc_meta_used  =   262 MB
  arc_meta_limit =  2812 MB
  arc_meta_max   =   335 MB

turning off atime helps:

  real 8m06.563s

in this test case, running du(1), turning off atime altogether isn't
really needed, it would suffice to turn off atime updates on
directories.  in Linux, this can be achieved with the mount option
nodiratime.  if ZFS had it, I guess it would be a new value for the
atime property, nodir or somesuch.

I quite often find it useful to have access to atime information to see
if files have been read, for forensic purposes, for debugging, etc. so I
am loath to turn it off.  however, atime on directories can hardly ever
be used for anything -- you have to take really good care not to trigger
an update just checking the atime, and even if you do get a proper
reading, there are so many tree traversing utilities that the
information value is low.  it is quite unlikely that any applications
break in a nodiratime mode, and few people should have any qualms
enabling it.

Santa, are you listening? :-)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128

2009-11-25 Thread Kjetil Torgrim Homme

Daniel Carosone d...@geek.com.au writes:

 you can fetch the cr_txg (cr for creation) for a
 snapshot using zdb,

 yes, but this is hardly an appropriate interface.

agreed.

 zdb is also likely to cause disk activity because it looks at many
 things other than the specific item in question.

I'd expect meta-information like this to fit comfortably in RAM over
extended amounts of time.  haven't tried, though.

 but the very creation of a snapshot requires a new
 txg to note that fact in the pool.

 yes, which is exactly what we're trying to avoid, because it requires
 disk activity to write.

you missed my point: you can't compare the current txg to an old cr_txg
directly, since the current txg value will be at least 1 higher, even if
no changes have been made.

 if the snapshot is taken recursively, all snapshots will have the
 same cr_txg, but that requires the same configuration for all
 filesets.

 again, yes, but that's irrelevant - the important knowledge at this
 moment is that the txg has not changed since last time, and that thus
 there will be no benefit in taking further snapshots, regardless of
 configuration.

yes, that's what we're trying to establish, and it's easier when
all snapshots are commited in the same txg.
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128

2009-11-24 Thread Kjetil Torgrim Homme

Daniel Carosone d...@geek.com.au writes:

 I don't think it is easy to do, the txg counter is on
 a pool level,
 [..]
 it would help when the entire pool is idle, though.

 .. which is exactly the scenario in question: when the disks are
 likely to be spun down already (or to spin down soon without further
 activity), and you want to avoid waking them up (or keeping them
 awake) with useless snapshot activity.

good point!

 However, this highlights that a (pool? fs?) property that exposes the
 current txg id (frozen in snapshots, as normal, if an fs property)
 might be enough for the userspace daemon to make its own decision to
 avoid requesting snapshots, without needing a whole discretionary
 mechanism in zfs itself.

you can fetch the cr_txg (cr for creation) for a snapshot using zdb,
but the very creation of a snapshot requires a new txg to note that fact
in the pool.  if there are several filesystems to snapshot, you'll get a
sequence of cr_txg, and they won't be adjacent.

  # zdb tank/te...@snap1
  Dataset tank/te...@snap1 [ZVOL], ID 78, cr_txg 872401, 4.03G, 3 objects
  # zdb -u tank
  txg = 872402
  timestamp = 1259064201 UTC = Tue Nov 24 13:03:21 2009
  # sync
  # zdb -u tank
  txg = 872402
  # zfs snapshot tank/te...@snap1
  # zdb tank/te...@snap1
  Dataset tank/te...@snap1 [ZVOL], ID 80, cr_txg 872419, 4.03G, 3 objects
  # zdb -u tank
  txg = 872420
  timestamp = 1259064641 UTC = Tue Nov 24 13:10:41 2009

if the snapshot is taken recursively, all snapshots will have the same
cr_txg, but that requires the same configuration for all filesets.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Basic question about striping and ZFS

2009-11-23 Thread Kjetil Torgrim Homme

Kjetil Torgrim Homme kjeti...@linpro.no writes:
 Cindy Swearingen cindy.swearin...@sun.com writes:
 You might check the slides on this page:

 http://hub.opensolaris.org/bin/view/Community+Group+zfs/docs

 Particularly, slides 14-18.

 In this case, graphic illustrations are probably the best way
 to answer your questions.

 thanks, Cindy.  can you explain the meaning of the blocks marked X in
 the illustration on page 18?

I found the explanation in an older (2009-09-03) message to this list
from Adam Leventhal:

|   RAID-Z writes full stripes every time; note that without careful
|   accounting it would be possible to effectively fragment the vdev
|   such that single sectors were free but useless since single-parity
|   RAID-Z requires two adjacent sectors to store data (one for data,
|   one for parity). To address this, RAID-Z rounds up its allocation to
|   the next (nparity + 1).  This ensures that all space is accounted
|   for. RAID-Z will thus skip sectors that are unused based on this
|   rounding. For example, under raidz1 a write of 1024 bytes would
|   result in 512 bytes of parity, 512 bytes of data on two devices and
|   512 bytes skipped.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128

2009-11-23 Thread Kjetil Torgrim Homme

Daniel Carosone d...@geek.com.au writes:

 Would there be a way to avoid taking snapshots if they're going to be
 zero-sized?

I don't think it is easy to do, the txg counter is on a pool level,
AFAIK:

  # zdb -u spool
  Uberblock

magic = 00bab10c
version = 13
txg = 1773324
guid_sum = 16611641539891595281
timestamp = 1258992244 UTC = Mon Nov 23 17:04:04 2009

it would help when the entire pool is idle, though.

 (posted here, rather than in response to the mailing list reference
 given, because I'm not subscribed [...]

ditto.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs-raidz - simulate disk failure

2009-11-23 Thread Kjetil Torgrim Homme

sundeep dhall sundeep.dh...@sun.com writes:
 Q) How do I simulate a sudden 1-disk failure to validate that zfs /
 raidz handles things well without data errors

 Options considered
 1. suddenly pulling a disk out 
 2. using zpool offline

 I think both these have issues in simulating a sudden failure 

why not take a look at what HP's test department is doing and fire a
round through the disk with a rifle?  oh, I guess that won't be a
*simulation*.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quick drive slicing madness question

2009-11-09 Thread Kjetil Torgrim Homme

Darren J Moffat darr...@opensolaris.org writes:

 Mauricio Tavares wrote:
 If I have a machine with two drives, could I create equal size slices
 on the two disks, set them up as boot pool (mirror) and then use the
 remaining space as a striped pool for other more wasteful
 applications?

 You could but why bother ?  Why not just create one mirrored pool.

you get half the space available...  even if you don't forego redundancy
and use mirroring on both slices, you can't extend the data pool later.

 Having two pools on the same disk (or mirroring to the same disk) is
 asking for performance pain if both are being written to heavily.

not too common with heavy writing to rpool, is it?  the main source of
writing is syslog, I guess.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] compression at zfs filesystem creation

2009-06-17 Thread Kjetil Torgrim Homme

David Magda dma...@ee.ryerson.ca writes:

 On Tue, June 16, 2009 15:32, Kyle McDonald wrote:

 So the cache saves not only the time to access the disk but also
 the CPU time to decompress. Given this, I think it could be a big
 win.

 Unless you're in GIMP working on JPEGs, or doing some kind of MPEG
 video editing--or ripping audio (MP3 / AAC / FLAC) stuff. All of
 which are probably some of the largest files in most people's
 homedirs nowadays.

indeed.  I think only programmers will see any substantial benefit
from compression, since both the code itself and the object files
generated are easily compressible.

 1 GB of e-mail is a lot (probably my entire personal mail collection
 for a decade) and will compress well; 1 GB of audio files is
 nothing, and won't compress at all.

 Perhaps compressing /usr could be handy, but why bother enabling
 compression if the majority (by volume) of user data won't do
 anything but burn CPU?

 So the correct answer on whether compression should be enabled by
 default is it depends. (IMHO :) )

I'd be interested to see benchmarks on MySQL/PostgreSQL performance
with compression enabled.  my *guess* would be it isn't beneficial
since they usually do small reads and writes, and there is little gain
in reading 4 KiB instead of 8 KiB.

what other uses cases can benefit from compression?
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] compression at zfs filesystem creation

2009-06-17 Thread Kjetil Torgrim Homme

Fajar A. Nugraha fa...@fajar.net writes:

 Kjetil Torgrim Homme wrote:
 indeed.  I think only programmers will see any substantial benefit
 from compression, since both the code itself and the object files
 generated are easily compressible.

 Perhaps compressing /usr could be handy, but why bother enabling
 compression if the majority (by volume) of user data won't do
 anything but burn CPU?

 How do you define substantial? My opensolaris snv_111b installation
 has 1.47x compression ratio for /, with the default compression.
 It's well worthed for me.

I don't really care if my / is 5 GB or 3 GB.  how much faster is
your system operating?  what's the compression rate on your data
areas?

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] compression at zfs filesystem creation

2009-06-17 Thread Kjetil Torgrim Homme

Monish Shah mon...@indranetworks.com writes:

 I'd be interested to see benchmarks on MySQL/PostgreSQL performance
 with compression enabled.  my *guess* would be it isn't beneficial
 since they usually do small reads and writes, and there is little
 gain in reading 4 KiB instead of 8 KiB.

 OK, now you have switched from compressibility of data to
 performance advantage.  As I said above, this kind of data usually
 compresses pretty well.

the thread has been about I/O performance since the first response, as
far as I can tell.

 I agree that for random reads, there wouldn't be any gain from
 compression. For random writes, in a copy-on-write file system,
 there might be gains, because the blocks may be arranged in
 sequential fashion anyway.  We are in the process of doing some
 performance tests to prove or disprove this.

 Now, if you are using SSDs for this type of workload, I'm pretty
 sure that compression will help writes.  The reason is that the
 flash translation layer in the SSD has to re-arrange the data and
 write it page by page.  If there is less data to write, there will
 be fewer program operations.

 Given that write IOPS rating in an SSD is often much less than read
 IOPS, using compression to improve that will surely be of great
 value.

not necessarily, since a partial SSD write is much more expensive than
a full block write (128 KiB?).  in a write intensive application, that
won't be an issue since the data is flowing steadily, but for the
right mix of random reads and writes, this may exacerbate the
bottleneck.

 At this point, this is educated guesswork.  I'm going to see if I
 can get my hands on an SSD to prove this.

that'd be great!

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] disabling showmount -e behaviour

2009-05-27 Thread Kjetil Torgrim Homme

Roman V Shaposhnik r...@sun.com writes:

 I must admit that this question originates in the context of Sun's
 Storage 7210 product, which impose additional restrictions on the
 kind of knobs I can turn.

 But here's the question: suppose I have an installation where ZFS is
 the storage for user home directories. Since I need quotas, each
 directory gets to be its own filesystem. Since I also need these
 homes to be accessible remotely each FS is exported via NFS. Here's
 the question though: how do I prevent showmount -e (or a manually
 constructed EXPORT/EXPORTALL RPC request) to disclose a list of
 users that are hosted on a particular server?

I think the best you can do is to reject mount protocol requests
coming from high ports (1024+) in your firewall.  this means you
need root priveleges (or more specific capability) on the client to
fetch the list.

another option is to make the usernames opaque and anonymous, e.g.,
u4233.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Kjetil Torgrim Homme

Frank Middleton f.middle...@apogeect.com writes:

 Exactly. My whole point. And without ECC there's no way of knowing.
 But if the data is damaged /after/ checksum but /before/ write, then
 you have a real problem...

we can't do much to protect ourselves from damage to the data itself
(an extra copy in RAM will help little and ruin performance).

damages to the bits holding the computed checksum before it is written
can be alleviated by doing the calculation independently for each
written copy.  in particular, this will help if the bit error is
transient.

since the number of octets in RAM holding the checksum dwarves the
number of octets occupied by data by a large ratio (256 bits vs. one
mebibit for a full default sized record), such a paranoia mode will
most likely tell you that the *data* is corrupt, not the checksum.
but today you don't know, so it's an improvement in my book.

 Quoting the ZFS admin guide: The failmode property ... provides the
 failmode property for determining the behavior of a catastrophic
 pool failure due to a loss of device connectivity or the failure of
 all devices in the pool. . Has this changed since the ZFS admin
 guide was last updated?  If not, it doesn't seem relevant.

I guess checksum error handling is orthogonal to this and should have
its own property.  it sure would be nice if the admin could ask the OS
to deliver the bits contained in a file, no matter what, and just log
the problem.

 Cheers -- Frank

thank you for pointing out this potential weakness in ZFS' consistency
checking, I didn't realise it was there.

also thank you, all ZFS developers, for your great job :-)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

59 matches

Mail list logo