Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-02-03 Thread Pasi Kärkkäinen
On Sun, Jan 20, 2013 at 07:51:15PM -0800, Richard Elling wrote:
 
  2. VAAI support.
 
VAAI has 4 features, 3 of which have been in illumos for a long time. The
remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
product,
but the CEO made a conscious (and unpopular) decision to keep that code
from the
community. Over the summer, another developer picked up the work in the
community,
but I've lost track of the progress and haven't seen an RTI yet.
 

I assume SCSI UNMAP is implemented in Comstar in NexentaStor? 
Isn't Comstar CDDL licensed? 

There's also this:
https://www.illumos.org/issues/701

.. which says UNMAP support was added to Illumos Comstar 2 years ago.


-- Pasi

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-02-01 Thread Joerg Schilling
casper@oracle.com wrote:

 It gets even better.  Executables become part of the swap space via
 mmap, so that if you have a lot of copies of the same process running in
 memory, the executable bits don't waste any more space (well, unless you
 use the sticky bit, although that might be deprecated, or if you copy
 the binary elsewhere.)  There's lots of awesome fun optimizations in
 UNIX. :)

 The sticky bit has never been used in  that form of SunOS for as long
 as I remember (SunOS 3.x) and probably before that.  It no longer makes 
 sense in demand-paged executables.

SunOS-3.0 introduced NFS-root and swap on NFS. For that reason, the meaning of 
the sticky bit was changed to mean do not cache write this file.

Note that SunOS-3.0 appeared with the new Sun3 machines (first build on 
24.12.1985).

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-29 Thread Robert Milkowski
  It also has a lot of performance improvements and general bug fixes
 in
  the Solaris 11.1 release.
 
 Performance improvements such as?


Dedup'ed ARC for one.
0 block automatically dedup'ed in-memory.
Improvements to ZIL performance.
Zero-copy zfs+nfs+iscsi
...


-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-29 Thread Sašo Kiselkov
On 01/29/2013 02:59 PM, Robert Milkowski wrote:
 It also has a lot of performance improvements and general bug fixes
 in
 the Solaris 11.1 release.

 Performance improvements such as?
 
 
 Dedup'ed ARC for one.
 0 block automatically dedup'ed in-memory.
 Improvements to ZIL performance.
 Zero-copy zfs+nfs+iscsi
 ...

Cool, thanks for the inspiration on my next work in Illumos' ZFS.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-29 Thread Robert Milkowski


From: Richard Elling
Sent: 21 January 2013 03:51

VAAI has 4 features, 3 of which have been in illumos for a long time. The
remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
product, 
but the CEO made a conscious (and unpopular) decision to keep that code
from the 
community. Over the summer, another developer picked up the work in the
community, 
but I've lost track of the progress and haven't seen an RTI yet.

That is one thing that always bothered me... so it is ok for others, like
Nexenta, to keep stuff closed and not in open, while if Oracle does it they
are bad?

Isn't it at least a little bit being hypocritical? (bashing Oracle and doing
sort of the same)

-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-29 Thread Sašo Kiselkov
On 01/29/2013 03:08 PM, Robert Milkowski wrote:
 From: Richard Elling
 Sent: 21 January 2013 03:51
 VAAI has 4 features, 3 of which have been in illumos for a long time. The
 remaining
 feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
 product, 
 but the CEO made a conscious (and unpopular) decision to keep that code
 from the 
 community. Over the summer, another developer picked up the work in the
 community, 
 but I've lost track of the progress and haven't seen an RTI yet.
 
 That is one thing that always bothered me... so it is ok for others, like
 Nexenta, to keep stuff closed and not in open, while if Oracle does it they
 are bad?
 
 Isn't it at least a little bit being hypocritical? (bashing Oracle and doing
 sort of the same)

Nexenta is a downstream repository that chooses to keep some of their
new developments in-house while making others open. Most importantly,
they participate and make a conscious effort to play nice.

Contrast this with Oracle. Oracle swoops in and buys up Sun, closes
*all* of the technologies it can turn a profit on, changes licensing
terms to extremely draconian and in the process takes a dump on all of
the open-source community and large numbers of their customers.

Now imagine which of these two is more popular in the community?

(Disclaimer: my company was formerly an almost exclusive Sun shop.)

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-29 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Robert Milkowski [mailto:rmilkow...@task.gda.pl]
 
 That is one thing that always bothered me... so it is ok for others, like
 Nexenta, to keep stuff closed and not in open, while if Oracle does it they
 are bad?

Oracle, like Nexenta, and my own company CleverTrove, and Microsoft, and 
Netapp, has every right to close source development, if they believe it's 
beneficial to their business.  For all we know, Oracle might not even have a 
choice about it - it might have been in the terms of settlement with NetApp 
(because open source ZFS definitely hurt NetApp business.)  The real question 
is, in which situations, is it beneficial to your business to be closed source, 
as opposed to open source?  There's the whole redhat/centos dichotomy.  At 
first blush, it would seem redhat gets screwed by centos (or oracle linux) but 
then you realize how many more redhat derived systems are out there, compared 
to suse, etc.  By allowing people to use it for free, it actually gains 
popularity, and then redhat actually has a successful support business model as 
compared to suse, which tanked.

But it's useless to argue about whether oracle's making the right business 
choice, whether open or closed source is better for their business.  Cuz it's 
their choice, regardless who agrees.  Arguing about it here isn't going to do 
any good.

Those of us who gained something and no longer count on having that benefit 
moving forward have a tendency to say You gave it to me for free before, now 
I'm pissed off because you're not giving it to me for free anymore.  instead 
of thanks for what you gave before.

The world moves on.  There's plenty of time to figure out which solution is 
best for you, the consumer, in the future product offerings:  commercial closed 
source product offering, open source product offering, or something completely 
different such as btrfs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-29 Thread Richard Elling
On Jan 29, 2013, at 6:08 AM, Robert Milkowski rmilkow...@task.gda.pl wrote:

 From: Richard Elling
 Sent: 21 January 2013 03:51
 
 VAAI has 4 features, 3 of which have been in illumos for a long time. The
 remaining
 feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
 product, 
 but the CEO made a conscious (and unpopular) decision to keep that code
 from the 
 community. Over the summer, another developer picked up the work in the
 community, 
 but I've lost track of the progress and haven't seen an RTI yet.
 
 That is one thing that always bothered me... so it is ok for others, like
 Nexenta, to keep stuff closed and not in open, while if Oracle does it they
 are bad?

Nexenta is just as bad. For the record, the illumos-community folks who worked 
at
Nexenta at the time were overruled by executive management. Some of those folks
are now executive management elsewhere :-)

 
 Isn't it at least a little bit being hypocritical? (bashing Oracle and doing
 sort of the same)

No, not at all.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-24 Thread Darren J Moffat



On 01/24/13 00:04, Matthew Ahrens wrote:

On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat
darr...@opensolaris.org mailto:darr...@opensolaris.org wrote:

Preallocated ZVOLs - for swap/dump.


Darren, good to hear about the cool stuff in S11.

Just to clarify, is this preallocated ZVOL different than the
preallocated dump which has been there for quite some time (and is in
Illumos)?  Can you use it for other zvols besides swap and dump?


It is the same but we are using it for swap now too.  It isn't available 
for general use.



Some background:  the zfs dump device has always been preallocated
(thick provisioned), so that we can reliably dump.  By definition,
something has gone horribly wrong when we are dumping, so this code path
needs to be as small as possible to have any hope of getting a dump.  So
we preallocate the space for dump, and store a simple linked list of
disk segments where it will be stored.  The dump device is not COW,
checksummed, deduped, compressed, etc. by ZFS.


For the sake of others, I know you know this Matt, the dump system does 
the compression so ZFS didn't need to anyway.



In Illumos (and S10), swap was treated more or less like a regular zvol.
  This leads to some tricky code paths because ZFS allocates memory from
many points in the code as it is writing out changes.  I could see
advantages to the simplicity of a preallocated swap volume, using the
same code that already existed for preallocated dump.  Of course, the
loss of checksumming and encryption is much more of a concern with swap
(which is critical for correct behavior) than with dump (which is nice
to have for debugging).


We have encryption for dump because it is hooked in to the zvol code.

For encrypting swap Illumos could do the same as Solaris 11 does and use 
lofi.  I changed swapadd so that if encryption is specified in the 
options field of the vfstab entry it creates a lofi shim over the swap 
device using 'lofiadm -e'.  This provides you encrypted swap regardless 
of what the underlying disk is (normal ZVOL, prealloc ZVOL, real disk 
slide, SVM mirror etc).


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-24 Thread Jim Klimov

On 2013-01-24 11:06, Darren J Moffat wrote:


On 01/24/13 00:04, Matthew Ahrens wrote:

On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat
darr...@opensolaris.org mailto:darr...@opensolaris.org wrote:

Preallocated ZVOLs - for swap/dump.


Darren, good to hear about the cool stuff in S11.


Yes, thanks, Darren :)


Just to clarify, is this preallocated ZVOL different than the
preallocated dump which has been there for quite some time (and is in
Illumos)?  Can you use it for other zvols besides swap and dump?


It is the same but we are using it for swap now too.  It isn't available
for general use.


Some background:  the zfs dump device has always been preallocated
(thick provisioned), so that we can reliably dump.  By definition,
something has gone horribly wrong when we are dumping, so this code path
needs to be as small as possible to have any hope of getting a dump.  So
we preallocate the space for dump, and store a simple linked list of
disk segments where it will be stored.  The dump device is not COW,
checksummed, deduped, compressed, etc. by ZFS.


Comparing these two statements, can I say (and be correct) that the
preallocated swap devices would lack COW (as I proposed too) and thus
likely snapshots, but would also lack the checksums? (we might live
without compression, though that was once touted as a bonus for swap
over zfs, and certainly can do without dedup)

Basically, they are seemingly little different from preallocated
disk slices - and for those an admin might have better control over
the dedicated disk locations (i.e. faster tracks in a small-seek
stroke range), except that ZFS datasets are easier to resize...
right or wrong?

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-23 Thread Casper . Dik

IIRC dump is special.

As for swap... really, you don't want to swap.  If you're swapping you
have problems.  Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM.  There
*are* exceptions to this, such as Varnish.  For Varnish and any other
apps like it I'd dedicate an entire flash drive to it, no ZFS, no
nothing.

Yes and no: the system reserves a lot of additional memory (Solaris 
doesn't over-commits swap) and swap is needed to support those 
reservations.  Also, some pages are dirtied early on and never touched 
again; those pages should not be kept in memory.

But continuously swapping is clearly a sign of a system too small for its 
job.

Of course, compressing and/or encrypting swap has interesting issues: in 
order to free memory by swapping pages out requires even more memory.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-23 Thread Jim Klimov

On 2013-01-23 09:41, casper@oracle.com wrote:

Yes and no: the system reserves a lot of additional memory (Solaris
doesn't over-commits swap) and swap is needed to support those
reservations.  Also, some pages are dirtied early on and never touched
again; those pages should not be kept in memory.



I believe, by the symptoms, that this is what happens often
in particular to Java processes (app-servers and such) - I do
regularly see these have large VM sizes and much (3x) smaller
RSS sizes. One explanation I've seen is that JVM nominally
depends on a number of shared libraries which are loaded to
fulfill the runtime requirements, but aren't actively used and
thus go out into swap quickly. I chose to trust that statement ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-23 Thread Ian Collins

Jim Klimov wrote:

On 2013-01-23 09:41, casper@oracle.com wrote:

Yes and no: the system reserves a lot of additional memory (Solaris
doesn't over-commits swap) and swap is needed to support those
reservations.  Also, some pages are dirtied early on and never touched
again; those pages should not be kept in memory.


I believe, by the symptoms, that this is what happens often
in particular to Java processes (app-servers and such) - I do
regularly see these have large VM sizes and much (3x) smaller
RSS sizes.


Being swapped out is probably the best thing that can be done to most 
Java processes :)


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-23 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Gary Mills [mailto:gary_mi...@fastmail.fm]
 
  In solaris, I've never seen it swap out idle processes; I've only
  seen it use swap for the bad bad bad situation.  I assume that's all
  it can do with swap.
 
 You would be wrong.  Solaris uses swap space for paging.  Paging out
 unused portions of an executing process from real memory to the swap
 device is certainly beneficial.  Swapping out complete processes is a
 desperation move, but paging out most of an idle process is a good
 thing.

You seem to be emphasizing the distinction between swapping and paging.  My 
point though, is that I've never seen the swap usage (which is being used for 
paging) on any solaris derivative to be used nonzero, for the sake of keeping 
something in cache.  It seems to me, that solaris will always evict all cache 
memory before it swaps (pages) out even the most idle process memory.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-23 Thread Ray Arachelian
On 01/22/2013 10:50 PM, Gary Mills wrote:
 On Tue, Jan 22, 2013 at 11:54:53PM +, Edward Ned Harvey 
 (opensolarisisdeadlongliveopensolaris) wrote:
 Paging out unused portions of an executing process from real memory to
 the swap device is certainly beneficial. Swapping out complete
 processes is a desperation move, but paging out most of an idle
 process is a good thing. 

It gets even better.  Executables become part of the swap space via
mmap, so that if you have a lot of copies of the same process running in
memory, the executable bits don't waste any more space (well, unless you
use the sticky bit, although that might be deprecated, or if you copy
the binary elsewhere.)  There's lots of awesome fun optimizations in
UNIX. :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-23 Thread Casper . Dik

On 01/22/2013 10:50 PM, Gary Mills wrote:
 On Tue, Jan 22, 2013 at 11:54:53PM +, Edward Ned Harvey 
 (opensolarisisdeadlongliveopensolari
s) wrote:
 Paging out unused portions of an executing process from real memory to
 the swap device is certainly beneficial. Swapping out complete
 processes is a desperation move, but paging out most of an idle
 process is a good thing. 

It gets even better.  Executables become part of the swap space via
mmap, so that if you have a lot of copies of the same process running in
memory, the executable bits don't waste any more space (well, unless you
use the sticky bit, although that might be deprecated, or if you copy
the binary elsewhere.)  There's lots of awesome fun optimizations in
UNIX. :)

The sticky bit has never been used in  that form of SunOS for as long
as I remember (SunOS 3.x) and probably before that.  It no longer makes 
sense in demand-paged executables.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-23 Thread Matthew Ahrens
On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat darr...@opensolaris.orgwrote:

 Preallocated ZVOLs - for swap/dump.


Darren, good to hear about the cool stuff in S11.

Just to clarify, is this preallocated ZVOL different than the preallocated
dump which has been there for quite some time (and is in Illumos)?  Can you
use it for other zvols besides swap and dump?

Some background:  the zfs dump device has always been preallocated (thick
provisioned), so that we can reliably dump.  By definition, something has
gone horribly wrong when we are dumping, so this code path needs to be as
small as possible to have any hope of getting a dump.  So we preallocate
the space for dump, and store a simple linked list of disk segments where
it will be stored.  The dump device is not COW, checksummed, deduped,
compressed, etc. by ZFS.

In Illumos (and S10), swap was treated more or less like a regular zvol.
 This leads to some tricky code paths because ZFS allocates memory from
many points in the code as it is writing out changes.  I could see
advantages to the simplicity of a preallocated swap volume, using the same
code that already existed for preallocated dump.  Of course, the loss of
checksumming and encryption is much more of a concern with swap (which is
critical for correct behavior) than with dump (which is nice to have for
debugging).

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Darren J Moffat

On 01/21/13 17:03, Sašo Kiselkov wrote:

Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.


Just a few examples:

Solaris ZFS already has support for 1MB block size.

Support for SCSI UNMAP - both issuing it and honoring it when it is the 
backing store of an iSCSI target.


It also has a lot of performance improvements and general bug fixes in 
the Solaris 11.1 release.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Tomas Forsman
On 22 January, 2013 - Darren J Moffat sent me these 0,6K bytes:

 On 01/21/13 17:03, Sa?o Kiselkov wrote:
 Again, what significant features did they add besides encryption? I'm
 not saying they didn't, I'm just not aware of that many.

 Just a few examples:

 Solaris ZFS already has support for 1MB block size.

 Support for SCSI UNMAP - both issuing it and honoring it when it is the  
 backing store of an iSCSI target.

Would this apply to say a SATA SSD used as ZIL? (which we have, a
vertex2ex with supercap)

/Tomas
-- 
Tomas Forsman, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Sašo Kiselkov
On 01/22/2013 12:30 PM, Darren J Moffat wrote:
 On 01/21/13 17:03, Sašo Kiselkov wrote:
 Again, what significant features did they add besides encryption? I'm
 not saying they didn't, I'm just not aware of that many.
 
 Just a few examples:
 
 Solaris ZFS already has support for 1MB block size.

Working on that as we speak.
I'll see your 1MB and raise you another 7 :P

 Support for SCSI UNMAP - both issuing it and honoring it when it is the
 backing store of an iSCSI target.

AFAIK, the first isn't in Illumos' ZFS, while the latter one is (though
I might be mistaken). In any case, interesting features.

 It also has a lot of performance improvements and general bug fixes in
 the Solaris 11.1 release.

Performance improvements such as?

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Darren J Moffat



On 01/22/13 11:57, Tomas Forsman wrote:

On 22 January, 2013 - Darren J Moffat sent me these 0,6K bytes:


On 01/21/13 17:03, Sa?o Kiselkov wrote:

Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.


Just a few examples:

Solaris ZFS already has support for 1MB block size.

Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.


Would this apply to say a SATA SSD used as ZIL? (which we have, a
vertex2ex with supercap)


If the device advertises the UNMAP feature and you are running Solaris 
11.1 it should attempt to use it.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Michel Jansens


Maybe 'shadow migration' ?  (eg: zfs create -o shadow=nfs://server/dir  
pool/newfs)


Michel


On 01/21/13 17:03, Sašo Kiselkov wrote:

Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.


Just a few examples:

Solaris ZFS already has support for 1MB block size.

Support for SCSI UNMAP - both issuing it and honoring it when it is  
the backing store of an iSCSI target.


It also has a lot of performance improvements and general bug fixes  
in the Solaris 11.1 release.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Michel Jansens
mjans...@ulb.ac.be



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Darren J Moffat



On 01/22/13 13:20, Michel Jansens wrote:


Maybe 'shadow migration' ? (eg: zfs create -o shadow=nfs://server/dir
pool/newfs)


That isn't really a ZFS feature, since it happens at the VFS layer.  The 
ZFS support there is really about getting the options passed through and 
checking status but the core of the work happens at the VFS layer.


Shadow migration works with UFS as well!

Since I'm replying here are a few others that have been introduced in 
Solaris 11 or 11.1.


There is also the new improved ZFS share syntax for NFS and CIFS in 
Solaris 11.1 where you can much more easily inherit and also override 
individual share properties.


There is improved diganostics rules.

ZFS support for Immutable Zones (mostly a VFS feature)  Extended 
(privilege) Policy and aliasing of datasets in Zones (so you don't see 
the part of the dataset hierarchy above the bit delegated to the zone).


UEFI GPT label support for root pools with GRUB2 and on SPARC with OBP.

New sensitive per file flag.

Various ZIL and ARC performance improvements.

Preallocated ZVOLs - for swap/dump.


Michel


On 01/21/13 17:03, Sašo Kiselkov wrote:

Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.


Just a few examples:

Solaris ZFS already has support for 1MB block size.

Support for SCSI UNMAP - both issuing it and honoring it when it is
the backing store of an iSCSI target.

It also has a lot of performance improvements and general bug fixes in
the Solaris 11.1 release.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Michel Jansens
mjans...@ulb.ac.be



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Sašo Kiselkov
On 01/22/2013 02:20 PM, Michel Jansens wrote:
 
 Maybe 'shadow migration' ?  (eg: zfs create -o shadow=nfs://server/dir
 pool/newfs)

Hm, interesting, so it works as a sort of replication system, except
that the data needs to be read-only and you can start accessing it on
the target before the initial sync. Did I get that right?

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Darren J Moffat



On 01/22/13 13:29, Sašo Kiselkov wrote:

On 01/22/2013 02:20 PM, Michel Jansens wrote:


Maybe 'shadow migration' ?  (eg: zfs create -o shadow=nfs://server/dir
pool/newfs)


Hm, interesting, so it works as a sort of replication system, except
that the data needs to be read-only and you can start accessing it on
the target before the initial sync. Did I get that right?


The source filesystem needs to be read-only.  It works at the VFS layer 
so it doesn't copy snapshots or clones over.  Once mounted it appears 
like all the original data is instantly there.


There is an (optional) shadowd that pushes the migration along, but it 
will complete on its own anyway.


shadowstat(1M) gives information on the status of the migrations.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Darren J Moffat


On 01/22/13 13:29, Darren J Moffat wrote:

Since I'm replying here are a few others that have been introduced in
Solaris 11 or 11.1.


and another one I can't believe I missed since I was one of the people 
that helped design it and I did codereview...


Per file sensitively labels for TX configurations.

and I'm sure I'm still missing stuff that is in Solaris 11 and 11.1.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Sašo Kiselkov
On 01/22/2013 02:39 PM, Darren J Moffat wrote:
 
 On 01/22/13 13:29, Darren J Moffat wrote:
 Since I'm replying here are a few others that have been introduced in
 Solaris 11 or 11.1.
 
 and another one I can't believe I missed since I was one of the people
 that helped design it and I did codereview...
 
 Per file sensitively labels for TX configurations.

Can you give some details on that? Google search are turning up pretty dry.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Casper . Dik

On 01/22/2013 02:39 PM, Darren J Moffat wrote:
 
 On 01/22/13 13:29, Darren J Moffat wrote:
 Since I'm replying here are a few others that have been introduced in
 Solaris 11 or 11.1.
 
 and another one I can't believe I missed since I was one of the people
 that helped design it and I did codereview...
 
 Per file sensitively labels for TX configurations.

Can you give some details on that? Google search are turning up pretty dry.


Start here:

http://docs.oracle.com/cd/E26502_01/html/E29017/managefiles-1.html#scrolltoc


Look for multilevel datasets.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Bob Friesenhahn

On Mon, 21 Jan 2013, Jim Klimov wrote:


Yes, maybe there were more cool new things per year popping up
with Sun's concentrated engineering talent and financing, but now
it seems that most players - wherever they work now - took a pause
from the marathon, to refine what was done in the decade before.
And this is just as important as churning out innovations faster
than people can comprehend or audit or use them.


I am on most of the mailing lists where zfs is discussed and it is 
clear that significant issues/bugs are continually being discovered 
and fixed.  Fixes come from both the Illumos community and from 
outside it (e.g. from FreeBSD).


Zfs is already quite feature rich.  Many of us would lobby for 
bug fixes and performance improvements over features.


Sašo Kiselkov's LZ4 compression additions may qualify as features 
yet they also offer rather profound performance improvements.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Darren J Moffat [mailto:darr...@opensolaris.org]
 
 Support for SCSI UNMAP - both issuing it and honoring it when it is the
 backing store of an iSCSI target.

When I search for scsi unmap, I come up with all sorts of documentation that 
... is ... like reading a medical journal when all you want to know is the 
conversion from 98.6F to C.

Would you mind momentarily, describing what SCSI UNMAP is used for?  If I were 
describing to a customer (CEO, CFO) I'm not going to tell them about SCSI 
UNMAP, I'm going to say the new system has a new feature that enables ... or 
solves the ___ problem...  

Customer doesn't *necessarily* have to be as clueless as CEO/CFO.  Perhaps just 
another IT person, or whatever.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Sašo Kiselkov
On 01/22/2013 04:32 PM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) wrote:
 From: Darren J Moffat [mailto:darr...@opensolaris.org]

 Support for SCSI UNMAP - both issuing it and honoring it when it is the
 backing store of an iSCSI target.
 
 When I search for scsi unmap, I come up with all sorts of documentation that 
 ... is ... like reading a medical journal when all you want to know is the 
 conversion from 98.6F to C.
 
 Would you mind momentarily, describing what SCSI UNMAP is used for?  If I 
 were describing to a customer (CEO, CFO) I'm not going to tell them about 
 SCSI UNMAP, I'm going to say the new system has a new feature that enables 
 ... or solves the ___ problem...  
 
 Customer doesn't *necessarily* have to be as clueless as CEO/CFO.  Perhaps 
 just another IT person, or whatever.

SCSI Unmap is a feature of the SCSI protocol that is used by SSDs to
signal that a given data block is no longer in use by the filesystem and
may be erased.

TLDR:
It makes writing to flash faster. Flash write latency degrades with
time, this prevents it from happening. Keep in mind that this is only
important for sync-write workloads (e.g. Databases, NFS, etc.), not
async-write workloads (file servers, bulk storage). For ZFS this is a
win if you're using a flash-based slog (ZIL) device. You can entirely
side-step this issue (and performance-sensitive applications often do)
by placing the slog onto a device not based on flash, e.g. DDRDrive x1,
ZeusRAM, etc.

THE DETAILS:
As you may know, flash memory cells, by design, cannot be overwritten.
They can only be read (very fast), written when they are empty (called
programmed, still quite fast) or erased (slow as hell). To implement
overwriting, when a flash controller detects an attempt to overwrite an
already programmed flash cell, it instead holds the write while it
erases the block first (which takes a lot of time), and only then
programs it with the new data.

Before SCSI Unmap (also called TRIM in SATA) filesystems had no way of
talking to the underlying flash memory to tell it that a given block of
data has been freed (e.g. due to a user deleting a file). So sooner or
later, a filesystem used up all empty blocks on the flash device and
essentially every write had to first erase some flash blocks to
complete. This impacts synchronous I/O write latency (e.g. ZIL, sync
database I/O, etc.).

With Unmap, a filesystem can preemptively tell the flash controller that
a given data block is no longer needed and the flash controller can, at
its leisure, pre-erase it. Thus, as long as you have free space on your
filesystem, most, if not all of your writes will be direct program
writes, not erase-program.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Andrew Gabriel

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:

From: Darren J Moffat [mailto:darr...@opensolaris.org]

Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.



When I search for scsi unmap, I come up with all sorts of documentation that 
... is ... like reading a medical journal when all you want to know is the 
conversion from 98.6F to C.

Would you mind momentarily, describing what SCSI UNMAP is used for?  If I were describing to a customer (CEO, CFO) I'm not going to tell them about SCSI UNMAP, I'm going to say the new system has a new feature that enables ... or solves the ___ problem...  


Customer doesn't *necessarily* have to be as clueless as CEO/CFO.  Perhaps just 
another IT person, or whatever.
  


SCSI UNMAP (or SATA TRIM) is a means of telling a storage device that 
some blocks are no longer needed. (This might be because a file has been 
deleted in the filesystem on the device.)


In the case of a Flash device, it can optimise usage by knowing this, 
e.g. it can perhaps perform a background erase on the real blocks so 
they're ready for reuse sooner, and/or better optimise wear leveling by 
having more spare space to play with. There are some devices in which 
this enables the device to improve its lifetime by performing better 
wear leveling when having more spare space. It can also help by avoiding 
some read-modify-write operations, if the device knows the data that is 
in the rest of the 4k block is no loner needed.


In the case of an iSCSI LUN target, these blocks no longer need to be 
archived, and if sparse space allocation is in use, the space they 
occupied can be freed off. In the particular case of ZFS provisioning 
the iSCSI LUN (COMSTAR), you might get performance improvements by 
having more free space to play with during other write operations to 
allow better storage layout optimisation.


So, bottom line is longer life of SSDs (maybe higher performance too if 
there's less waiting for erases during writes), and better space 
utilisation and performance for a ZFS COMSTAR target.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Darren J Moffat



On 01/22/13 15:32, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

From: Darren J Moffat [mailto:darr...@opensolaris.org]

Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.


When I search for scsi unmap, I come up with all sorts of documentation that 
... is ... like reading a medical journal when all you want to know is the 
conversion from 98.6F to C.

Would you mind momentarily, describing what SCSI UNMAP is used for?  If I were 
describing to a customer (CEO, CFO) I'm not going to tell them about SCSI 
UNMAP, I'm going to say the new system has a new feature that enables ... or 
solves the ___ problem...

Customer doesn't *necessarily* have to be as clueless as CEO/CFO.  Perhaps just 
another IT person, or whatever.


It is a mechanism for part of the storage system above the disk (eg 
ZFS) to inform the disk that it is no longer using a given set of blocks.


This is useful when using an SSD - see Saso's excellent response on that.

However it can also be very useful when your disk is an iSCSI LUN.  It 
allows the filesystem layer (eg ZFS or NTFS, etc) when on iSCSI LUN that 
advertises SCSI UNMAP to tell the target there are blocks in that LUN it 
isn't using any more (eg it just deleted some blocks).


This means you can get more accurate space usage when using things like 
iSCSI.


ZFS in Solaris 11.1 issues SCSI UNMAP to devices that support it and the 
ZVOLs when exported over COMSTAR advertise it too.


In the iSCSI case it is mostly about improved space accounting and 
utilisation.  This is particularly interesting with ZFS when snapshots 
and clones of ZVOLs come into play.


Some vendors call this (and thins like it) Thin Provisioning, I'd say 
it is more accurate communication between 'disk' and filesystem about 
in use blocks.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Casper . Dik


Some vendors call this (and thins like it) Thin Provisioning, I'd say 
it is more accurate communication between 'disk' and filesystem about 
in use blocks.

In some cases, users of disks are charged by bytes in use; when not using
SCSI UNMAP, a set of disks used for a zpool will in the end be charged for
the whole reservation; this becomes costly when your standard usage is 
much less than your peak usage.

Thin provisioning can now be used for zpools as long as the underlying 
LUNs have support for SCSI UNMAP


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Sašo Kiselkov
On 01/22/2013 05:00 PM, casper@oracle.com wrote:
 Some vendors call this (and thins like it) Thin Provisioning, I'd say 
 it is more accurate communication between 'disk' and filesystem about 
 in use blocks.
 
 In some cases, users of disks are charged by bytes in use; when not using
 SCSI UNMAP, a set of disks used for a zpool will in the end be charged for
 the whole reservation; this becomes costly when your standard usage is 
 much less than your peak usage.
 
 Thin provisioning can now be used for zpools as long as the underlying 
 LUNs have support for SCSI UNMAP

Looks like an interesting technical solution to a political problem :D

Cheers,
--
Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Darren J Moffat



On 01/22/13 16:02, Sašo Kiselkov wrote:

On 01/22/2013 05:00 PM, casper@oracle.com wrote:

Some vendors call this (and thins like it) Thin Provisioning, I'd say
it is more accurate communication between 'disk' and filesystem about
in use blocks.


In some cases, users of disks are charged by bytes in use; when not using
SCSI UNMAP, a set of disks used for a zpool will in the end be charged for
the whole reservation; this becomes costly when your standard usage is
much less than your peak usage.

Thin provisioning can now be used for zpools as long as the underlying
LUNs have support for SCSI UNMAP


Looks like an interesting technical solution to a political problem :D


There is also a technical problem too: because if you can't inform the 
backing store that you no longer need the blocks it can't free them 
either so they get stuck in snapshots unnecessarily.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Sašo Kiselkov
On 01/22/2013 05:34 PM, Darren J Moffat wrote:
 
 
 On 01/22/13 16:02, Sašo Kiselkov wrote:
 On 01/22/2013 05:00 PM, casper@oracle.com wrote:
 Some vendors call this (and thins like it) Thin Provisioning, I'd say
 it is more accurate communication between 'disk' and filesystem about
 in use blocks.

 In some cases, users of disks are charged by bytes in use; when not
 using
 SCSI UNMAP, a set of disks used for a zpool will in the end be
 charged for
 the whole reservation; this becomes costly when your standard usage is
 much less than your peak usage.

 Thin provisioning can now be used for zpools as long as the underlying
 LUNs have support for SCSI UNMAP

 Looks like an interesting technical solution to a political problem :D
 
 There is also a technical problem too: because if you can't inform the
 backing store that you no longer need the blocks it can't free them
 either so they get stuck in snapshots unnecessarily.

Yes, I understand the technical merit of the solution. I'm just amused
that a noticeable side-effect is lower licensing costs (by that I don't
of course mean that the issue is unimportant, just that I find it
interesting what the world has come to) - I'm not trying to ridicule.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Jim Klimov

On 2013-01-22 14:29, Darren J Moffat wrote:

Preallocated ZVOLs - for swap/dump.


Sounds like something I proposed on these lists, too ;)
Does this preallocation only mean filling an otherwise ordinary
ZVOL with zeroes (or some other pattern) - if so, to what effect?

Or is it also supported to disable COW for such datasets, so that
the preallocated swap/dump zvols might remain contiguous on the
faster tracks of the drive (i.e. like a dedicated partition, but
with benefits of ZFS checksums and maybe compression)?

Thanks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Ian Collins

Darren J Moffat wrote:

It is a mechanism for part of the storage system above the disk (eg
ZFS) to inform the disk that it is no longer using a given set of blocks.

This is useful when using an SSD - see Saso's excellent response on that.

However it can also be very useful when your disk is an iSCSI LUN.  It
allows the filesystem layer (eg ZFS or NTFS, etc) when on iSCSI LUN that
advertises SCSI UNMAP to tell the target there are blocks in that LUN it
isn't using any more (eg it just deleted some blocks).


That is something I have been waiting a long time for!  I have to run a 
periodic fill the pool with zeros cycle on a couple of iSCSI backed 
pools to reclaim free space.


I guess the big question is do oracle storage appliances advertise SCSI 
UNMAP?


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Sašo Kiselkov
On 01/22/2013 10:45 PM, Jim Klimov wrote:
 On 2013-01-22 14:29, Darren J Moffat wrote:
 Preallocated ZVOLs - for swap/dump.
 
 Or is it also supported to disable COW for such datasets, so that
 the preallocated swap/dump zvols might remain contiguous on the
 faster tracks of the drive (i.e. like a dedicated partition, but
 with benefits of ZFS checksums and maybe compression)?

I highly doubt it, as it breaks one of the fundamental design principles
behind ZFS (always maintain transactional consistency). Also,
contiguousness and compression are fundamentally at odds (contiguousness
requires each block to remain the same length regardless of contents,
compression varies block length depending on the entropy of the contents).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Jim Klimov

On 2013-01-22 23:03, Sašo Kiselkov wrote:

On 01/22/2013 10:45 PM, Jim Klimov wrote:

On 2013-01-22 14:29, Darren J Moffat wrote:

Preallocated ZVOLs - for swap/dump.


Or is it also supported to disable COW for such datasets, so that
the preallocated swap/dump zvols might remain contiguous on the
faster tracks of the drive (i.e. like a dedicated partition, but
with benefits of ZFS checksums and maybe compression)?


I highly doubt it, as it breaks one of the fundamental design principles
behind ZFS (always maintain transactional consistency). Also,
contiguousness and compression are fundamentally at odds (contiguousness
requires each block to remain the same length regardless of contents,
compression varies block length depending on the entropy of the contents).


Well, dump and swap devices are kind of special in that they need
verifiable storage (i.e. detectable to have no bit-errors) but not
really consistency as in sudden-power-off transaction protection.
Both have a lifetime span of a single system uptime - like L2ARC,
for example - and will be reused anew afterwards - after a reboot,
a power-surge, or a kernel panic.

So while metadata used to address the swap ZVOL contents may and
should be subject to common ZFS transactions and COW and so on,
and jump around the disk along with rewrites of blocks, the ZVOL
userdata itself may as well occupy the same positions on the disk,
I think, rewriting older stuff. With mirroring likely in place as
well as checksums, there are other ways than COW to ensure that
the swap (at least some component thereof) contains what it should,
even with intermittent errors of some component devices.

Likewise, swap/dump breed of zvols shouldn't really have snapshots,
especially not automatic ones (and the installer should take care
of this at least for the two zvols it creates) ;)

Compression for swap is an interesting matter... for example, how
should it be accounted? As dynamic expansion and/or shrinking of
available swap space (or just of space needed to store it)?

If the latter, and we still intend to preallocate and guarantee
that the swap has its administratively predefined amount of
gigabytes, compressed blocks can be aligned on those starting
locations as if they were not compressed. In effect this would
just decrease the bandwidth requirements, maybe.

For dump this might be just a bulky compressed write from start
to however much it needs, within the preallocated psize limits...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Nico Williams
IIRC dump is special.

As for swap... really, you don't want to swap.  If you're swapping you
have problems.  Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM.  There
*are* exceptions to this, such as Varnish.  For Varnish and any other
apps like it I'd dedicate an entire flash drive to it, no ZFS, no
nothing.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Jim Klimov

On 2013-01-22 23:32, Nico Williams wrote:

IIRC dump is special.

As for swap... really, you don't want to swap.  If you're swapping you
have problems.  Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM.  There
*are* exceptions to this, such as Varnish.  For Varnish and any other
apps like it I'd dedicate an entire flash drive to it, no ZFS, no
nothing.


I know of this stance, and in general you're right. But... ;)

Sometimes, there are once-in-a-longtime tasks that might require
enormous virtual memory that you wouldn't normally provision
proper hardware for (RAM, SSD) and/or cases when you have to run
similarly greedy tasks on hardware with limited specs (i.e. home
PC capped at 8GB RAM). As an example I might think of a ZDB walk
taking about 35-40GB VM on my box. This is not something I do
every month, but when I do - I need it to complete regardless
that I have 5 times less RAM on that box (and kernel's equivalent
of that walk fails with scanrate hell because it can't swap, btw).

On another hand, there are tasks like VirtualBox which require
swap to be configured in amounts equivalent to VM RAM size, but
don't really swap (most of the time). Setting aside SSDs for this
task might be too expensive, if they are never to be used in real
practice.

But this point is more of a task for swap device tiering (like
with Linux swap priorities), as I proposed earlier last year...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Sašo Kiselkov
On 01/22/2013 11:22 PM, Jim Klimov wrote:
 On 2013-01-22 23:03, Sašo Kiselkov wrote:
 On 01/22/2013 10:45 PM, Jim Klimov wrote:
 On 2013-01-22 14:29, Darren J Moffat wrote:
 Preallocated ZVOLs - for swap/dump.

 Or is it also supported to disable COW for such datasets, so that
 the preallocated swap/dump zvols might remain contiguous on the
 faster tracks of the drive (i.e. like a dedicated partition, but
 with benefits of ZFS checksums and maybe compression)?

 I highly doubt it, as it breaks one of the fundamental design principles
 behind ZFS (always maintain transactional consistency). Also,
 contiguousness and compression are fundamentally at odds (contiguousness
 requires each block to remain the same length regardless of contents,
 compression varies block length depending on the entropy of the
 contents).
 
 Well, dump and swap devices are kind of special in that they need
 verifiable storage (i.e. detectable to have no bit-errors) but not
 really consistency as in sudden-power-off transaction protection.

I get your point, but I would argue that if you are willing to
preallocate storage for these, then putting dump/swap on an iSCSI LUN as
opposed to having it locally is kind of pointless anyway. Since they are
used rarely, having them thin provisioned is probably better in a
iSCSI environment than wasting valuable network-storage resources on
something you rarely need.

 Both have a lifetime span of a single system uptime - like L2ARC,
 for example - and will be reused anew afterwards - after a reboot,
 a power-surge, or a kernel panic.

For the record, the L2ARC is not transactionally consistent. It use a
completely different allocation strategy from the main pool (essentially
a simple rotor). Besides, if you plan to shred your dump contents after
reboot anyway, why fat-provision them? I can understand swap, but dump?

 So while metadata used to address the swap ZVOL contents may and
 should be subject to common ZFS transactions and COW and so on,
 and jump around the disk along with rewrites of blocks, the ZVOL
 userdata itself may as well occupy the same positions on the disk,
 I think, rewriting older stuff. With mirroring likely in place as
 well as checksums, there are other ways than COW to ensure that
 the swap (at least some component thereof) contains what it should,
 even with intermittent errors of some component devices.

You don't understand, the transactional integrity in ZFS isn't just to
protect the data you put in, it's also meant to protect ZFS' internal
structure (i.e. the metadata). This includes the layout of your zvols
(which are also just another dataset). I understand that you want to
view a this kind of fat-provisioned zvol as a simple contiguous
container block, but it is probably more hassle to implement than it's
worth.

 Likewise, swap/dump breed of zvols shouldn't really have snapshots,
 especially not automatic ones (and the installer should take care
 of this at least for the two zvols it creates) ;)

If you are talking about the standard opensolaris-style
boot-environments, then yes, this is taken into account. Your BE lives
under rpool/ROOT, while swap and dump are rpool/swap and rpool/dump
respectively (both thin-provisioned, since they are rarely needed).

 Compression for swap is an interesting matter... for example, how
 should it be accounted? As dynamic expansion and/or shrinking of
 available swap space (or just of space needed to store it)?

Since compression occurs way below the dataset layer, your zvol capacity
doesn't change with compression, even though how much space it actually
uses in the pool can. A zvol's capacity pertains to its logical
attributes, i.e. most importantly the maximum byte offset within it
accessible to an application (in this case, swap). How the underlying
blocks are actually stored and how much space they take up is up to the
lower layers.

 If the latter, and we still intend to preallocate and guarantee
 that the swap has its administratively predefined amount of
 gigabytes, compressed blocks can be aligned on those starting
 locations as if they were not compressed. In effect this would
 just decrease the bandwidth requirements, maybe.

But you forget that a compressed block's physical size fundamentally
depends on its contents. That's why compressed zvols still appear the
same size as before. What changes is how much space they occupy on the
underlying pool.

 For dump this might be just a bulky compressed write from start
 to however much it needs, within the preallocated psize limits...

I hope you now understand the distinction between the logical size of a
zvol and its actual in-pool size. We can't tie one to other, since it
would result in unpredictable behavior for the application (write one
set of data, get capacity X, write another set, get capacity Y - how to
determine in advance how much fits in? You can't).

Cheers,
--
Saso
___
zfs-discuss mailing list

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nico Williams
 
 As for swap... really, you don't want to swap.  If you're swapping you
 have problems.  

For clarification, the above is true in Solaris and derivatives, but it's not 
universally true for all OSes.  I'll cite linux as the example, because I know 
it.  If you provide swap to a linux kernel, it considers this a degree of 
freedom when choosing to evict data from the cache, versus swapping out idle 
processes (or zombie processes.)  As long as you swap out idle process memory 
that is colder than some cache memory, swap actually improves performance.  But 
of course, if you have any active process starved of ram and consequently 
thrashing swap actively, of course, you're right.  It's bad bad bad to use swap 
that way.

In solaris, I've never seen it swap out idle processes; I've only seen it use 
swap for the bad bad bad situation.  I assume that's all it can do with swap.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Jim Klimov

The discussion gets suddenly hot and interesting - albeit quite diverged
from the original topic ;)

First of all, as a disclaimer, when I have earlier proposed such changes
to datasets for swap (and maybe dump) use, I've explicitly proposed that
this be a new dataset type - compared to zvol and fs and snapshot that
we have today. Granted, this distinction was lost in today's exchange
of words, but it is still an important one - especially since it means
that while basic ZFS (or rather ZPOOL) rules are maintained, the dataset
rules might be redefined ;)

I'll try to reply to a few points below, snipping a lot of older text.

 Well, dump and swap devices are kind of special in that they need

verifiable storage (i.e. detectable to have no bit-errors) but not
really consistency as in sudden-power-off transaction protection.


I get your point, but I would argue that if you are willing to
preallocate storage for these, then putting dump/swap on an iSCSI LUN as
opposed to having it locally is kind of pointless anyway. Since they are
used rarely, having them thin provisioned is probably better in a
iSCSI environment than wasting valuable network-storage resources on
something you rarely need.


I am not sure what in my post led you to think that I meant iSCSI
or otherwise networked storage to keep swap and dump. Some servers
have local disks, you know - and in networked storage environments
the local disks are only used to keep the OS image, swap and dump ;)


Besides, if you plan to shred your dump contents after
reboot anyway, why fat-provision them? I can understand swap, but dump?


Guarantee that the space is there... Given the recent mischiefs
with dumping (i.e. the context is quite stripped compared to the
general kernel work, so multithreading broke somehow) I guess that
pre-provisioned sequential areas might also reduce some risks...
though likely not - random metadata would still have to get into
the pool.


You don't understand, the transactional integrity in ZFS isn't just to
protect the data you put in, it's also meant to protect ZFS' internal
structure (i.e. the metadata). This includes the layout of your zvols
(which are also just another dataset). I understand that you want to
view a this kind of fat-provisioned zvol as a simple contiguous
container block, but it is probably more hassle to implement than it's
worth.


I'd argue that transactional integrity in ZFS primarily protects
metadata, so that there is a tree of always-actual block pointers.
There is this octopus of a block-pointer tree whose leaf nodes
point to data blocks - but only as DVAs and checksums, basically.
Nothing really requires data to be or not be COWed and stored at
a different location than the previous version of the block at
the same logical offset for the data consumers (FS users, zvol
users), except that we want that data to be readable even after
a catastrophic pool close (system crash, poweroff, etc.).

We don't (AFAIK) have such a requirement for swap. If the pool
which contained swap kicked the bucket, we probably have a
larger problem whose solution will likely involve reboot and thus
recycling of all swap data.

And for single-device errors with (contiguous) preallocated
unrelocatable swap, we can protect with mirrors and checksums
(used upon read, within this same uptime that wrote the bits).




Likewise, swap/dump breed of zvols shouldn't really have snapshots,
especially not automatic ones (and the installer should take care
of this at least for the two zvols it creates) ;)


If you are talking about the standard opensolaris-style
boot-environments, then yes, this is taken into account. Your BE lives
under rpool/ROOT, while swap and dump are rpool/swap and rpool/dump
respectively (both thin-provisioned, since they are rarely needed).


I meant the attribute for zfs-auto-snapshots service, i.e.:
rpool/swap  com.sun:auto-snapshot  false  local

As I wrote, I'd argue that for new swap (and maybe dump) datasets
the snapshot action should not even be implemented.




Compression for swap is an interesting matter... for example, how
should it be accounted? As dynamic expansion and/or shrinking of
available swap space (or just of space needed to store it)?


Since compression occurs way below the dataset layer, your zvol capacity
doesn't change with compression, even though how much space it actually
uses in the pool can. A zvol's capacity pertains to its logical
attributes, i.e. most importantly the maximum byte offset within it
accessible to an application (in this case, swap). How the underlying
blocks are actually stored and how much space they take up is up to the
lower layers.

...

But you forget that a compressed block's physical size fundamentally
depends on its contents. That's why compressed zvols still appear the
same size as before. What changes is how much space they occupy on the
underlying pool.


I won't argue with this, as it is perfectly correct for zvols and
undefined for the 

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Gary Mills
On Tue, Jan 22, 2013 at 11:54:53PM +, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Nico Williams
  
  As for swap... really, you don't want to swap.  If you're swapping you
  have problems.  
 
 In solaris, I've never seen it swap out idle processes; I've only
 seen it use swap for the bad bad bad situation.  I assume that's all
 it can do with swap.

You would be wrong.  Solaris uses swap space for paging.  Paging out
unused portions of an executing process from real memory to the swap
device is certainly beneficial.  Swapping out complete processes is a
desperation move, but paging out most of an idle process is a good
thing.

-- 
-Gary Mills--refurb--Winnipeg, Manitoba, Canada-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-21 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
 I disagree the ZFS is developmentally challenged. 

As an IT consultant, 8 years ago before I heard of ZFS, it was always easy to 
sell Ontap, as long as it fit into the budget.  5 years ago, whenever I told 
customers about ZFS, it was always a quick easy sell.  Nowadays, anybody who's 
heard of it says they don't want it, because they believe it's a dying product, 
and they're putting their bets on linux instead.  I try to convince them 
otherwise, but I'm trying to buck the word on the street.  They don't listen, 
however much sense I make.  I can only sell ZFS to customers nowadays, who have 
still never heard of it.

Developmentally challenged doesn't mean there is no development taking place. 
 It means the largest development effort is working closed-source, and not 
available for free (except some purposes), so some consumers are going to 
follow their path, while others are going to follow the open source branch 
illumos path, which means both disunity amongst developers and disunity amongst 
consumers, and incompatibility amongst products.  So far, in the illumos 
branch, I've only seen bugfixes introduced since zpool 28, no significant 
introduction of new features.  (Unlike the oracle branch, which is just as easy 
to sell as ontap).

Which presents a challenge.  Hence the term, challenged.

Right now, ZFS is the leading product as far as I'm concerned.  Better than MS 
VSS, better than Ontap, better than BTRFS.  It is my personal opinion that one 
day BTRFS will eclipse ZFS due to oracle's unsupportive strategy causing 
disparity and lowering consumer demand for zfs, but of course, that's just a 
personal opinion prediction for the future, which has yet to be seen.  So far, 
every time I evaluate BTRFS, it fails spectacularly, but the last time I did, 
was about a year ago.  I'm due for a BTRFS re-evaluation now.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-21 Thread Dan Swartzendruber

Zfs on linux (ZOL) has made some pretty impressive strides over the last
year or so...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-21 Thread Sašo Kiselkov
On 01/21/2013 02:28 PM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) wrote:
 From: Richard Elling [mailto:richard.ell...@gmail.com]

 I disagree the ZFS is developmentally challenged. 
 
 As an IT consultant, 8 years ago before I heard of ZFS, it was always easy
 to sell Ontap, as long as it fit into the budget.  5 years ago, whenever I
 told customers about ZFS, it was always a quick easy sell.  Nowadays,
 anybody who's heard of it says they don't want it, because they believe
 it's a dying product, and they're putting their bets on linux instead. I
 try to convince them otherwise, but I'm trying to buck the word on the street.
 They don't listen, however much sense I make. I can only sell ZFS to
 customers nowadays, who have still never heard of it.

Yes, Oracle did some serious damage to ZFS' and its own reputation. My
former employer used to be an almost exclusive Sun-shop. The moment
Oracle took over and decided to tank the products aimed at our segment,
we waved our beloved Sun hardware goodbye. Larry has clearly delineated
his marketing strategy: either you're a Fortune500, or you can fuck
right off.

 Developmentally challenged doesn't mean there is no development taking 
 place.
 It means the largest development effort is working closed-source, and not
 available for free (except some purposes), so some consumers are going to
 follow their path,

I would contest that point. Besides encryption (which I think was
already well underway by the time Oracle took over), AFAIK nothing much
improved in Oracle ZFS. Oracle only considers Sun a vehicle to sell its
software products on (DB, ERP, CRM, etc.). Anything that doesn't fit
into that strategy (e.g. Thumper) got butchered and thrown to the side.

 while others are going to follow the open source branch illumos path, which
 means both disunity amongst developers and disunity amongst consumers, and
 incompatibility amongst products.

I can't talk about disunity among devs (how would that manifest
itself?), but as far as incompatibility among products, I've yet to come
across it. In fact, thanks to ZFS feature flags, different feature sets
can coexist peacefully and give admins unprecedented control over their
storage pools. Version control in ZFS used to be a take it or leave it
approach, now you can selectively enable and use only features you want to.

 So far, in the illumos branch, I've only seen bugfixes introduced since
 zpool 28, no significant introduction of new features.

I've had #3035 LZ4 compression for ZFS and GRUB integrated just a few
days back and I've got #3137 L2ARC compression up for review as we
speak. Waiting for #3137 to integrate, I'm looking to focus on multi-MB
record sizes next, and then perhaps taking a long hard look at reducing
the in-memory DDT footprint.

 (Unlike the oracle branch, which is just as easy to sell as ontap).

Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.

 Which presents a challenge.  Hence the term, challenged.

Agreed, it is a challenge and needs to be taken seriously. We are up
against a lot of money and man-hours invested by big-name companies, so
I fully agree there. We need to rally ourselves as a community hold
together tightly.

 Right now, ZFS is the leading product as far as I'm concerned.  Better
 than MS VSS, better than Ontap, better than BTRFS.  It is my personal
 opinion that one day BTRFS will eclipse ZFS due to oracle's unsupportive
 strategy causing disparity and lowering consumer demand for zfs, but of
 course, that's just a personal opinion prediction for the future, which
 has yet to be seen.  So far, every time I evaluate BTRFS, it fails
 spectacularly, but the last time I did, was about a year ago.  I'm due
 for a BTRFS re-evaluation now.

Let us know at z...@lists.illumos.org how that goes, perhaps write a blog
post about your observations. I'm sure the BTRFS folks came up with some
neat ideas which we might learn from.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-21 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Sašo Kiselkov [mailto:skiselkov...@gmail.com]
 
 as far as incompatibility among products, I've yet to come
 across it

I was talking about ... install solaris 11, and it's using a new version of zfs 
that's incompatible with anything else out there.  And vice-versa.  (Not sure 
if feature flags is the default, or zpool 28 is the default, in various 
illumos-based distributions.  But my understanding is that once you upgrade to 
feature flags, you can't go back to 28.  Which means, mutually, anything 28 is 
incompatible with each other.)  You have to typically make a conscious decision 
and plan ahead, and intentionally go to zpool 28 and no higher, if you want 
compatibility between systems.


 Let us know at z...@lists.illumos.org how that goes, perhaps write a blog
 post about your observations. I'm sure the BTRFS folks came up with some
 neat ideas which we might learn from.

Actually - I've written about it before (but it'll be difficult to find, and 
nothing earth shattering, so not worth the search.)  I don't think there's 
anything that zfs developers don't already know.  Basic stuff like fsck, and 
ability to shrink and remove devices, those are the things btrfs has and zfs 
doesn't.  (But there's lots more stuff that zfs has and btrfs doesn't.  Just 
making sure my previous comment isn't seen as a criticism of zfs, or a 
judgement in favor of btrfs.)

And even with a new evaluation, the conclusion can't be completely clear, nor 
immediate.  Last evaluation started about 10 months ago, and we kept it in 
production for several weeks or a couple of months, because it appeared to be 
doing everything well.  (Except for features that were known to be not-yet 
implemented, such as read-only snapshots (aka quotas) and btrfs-equivalent of 
zfs send.)  Problem was, the system was unstable, crashing about once a week. 
 No clues why.  We tried all sorts of things in kernel, hardware, drivers, with 
and without support, to diagnose and capture the cause of the crashes.  Then 
one day, I took a blind stab in the dark (for the ninetieth time) and I 
reformatted the storage volume ext4 instead of btrfs.  After that, no more 
crashes.  That was approx 8 months ago.

I think the only thing I could learn upon a new evaluation is:  #1  I hear 
btrfs send is implemented now.  I'd like to see it with my own eyes before I 
believe it.  #2  I hear quotas (read-only snapshots) are implemented now.  
Again, I'd like to see it before I believe it.  #3  Proven stability.  Never 
seen it yet with btrfs.  Want to see it with my eyes and stand the test of time 
before it earns my trust.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-21 Thread Sašo Kiselkov
On 01/22/2013 03:56 AM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) wrote:
 From: Sašo Kiselkov [mailto:skiselkov...@gmail.com]

 as far as incompatibility among products, I've yet to come
 across it
 
 I was talking about ... install solaris 11, and it's using a new version
 of zfs that's incompatible with anything else out there.  And vice-versa.

Wait, you're complaining about a closed-source vendor who did a
conscious effort to fuck the rest of the community over? I think you're
crying on the wrong shoulder - it wasn't the open ZFS community that
pulled this dick move. Yes, you can argue that the customer isn't
interested in politics, but unfortunately, there are some things that we
simply can't do anything about - the ball is in Oracle's court on this one.

  (Not sure if feature flags is the default, or zpool 28 is the default,
 in various illumos-based distributions.  But my understanding is that
 once you upgrade to feature flags, you can't go back to 28.  Which means,
 mutually, anything 28 is incompatible with each other.)  You have to
 typically make a conscious decision and plan ahead, and intentionally go
 to zpool 28 and no higher, if you want compatibility between systems.

Yes, feature flags is the default, simply because it is a way for open
ZFS vendors to interoperate. Oracle is an important player in ZFS for
sure, but we can't let their unwillingness to cooperate with others hold
the whole community in stasis - that is actually what they would have
wanted.

 Let us know at z...@lists.illumos.org how that goes, perhaps write a blog
 post about your observations. I'm sure the BTRFS folks came up with some
 neat ideas which we might learn from.
 
 Actually - I've written about it before (but it'll be difficult to find,
 and nothing earth shattering, so not worth the search.)  I don't think
 there's anything that zfs developers don't already know.  Basic stuff like
 fsck, and ability to shrink and remove devices, those are the things btrfs
 has and zfs doesn't.  (But there's lots more stuff that zfs has and btrfs
 doesn't.  Just making sure my previous comment isn't seen as a criticism
 of zfs, or a judgement in favor of btrfs.)

Well, I learned of the LZ4 compression algorithm in a benchmark
comparison of ZFS, BTRFS and other filesystem compression. Seeing that
there were better things out there I decided to try and push the state
of ZFS compression ahead a little.

 And even with a new evaluation, the conclusion can't be completely clear,
 nor immediate.  Last evaluation started about 10 months ago, and we kept
 it in production for several weeks or a couple of months, because it
 appeared to be doing everything well.  (Except for features that were known
 to be not-yet implemented, such as read-only snapshots (aka quotas) and
 btrfs-equivalent of zfs send.)  Problem was, the system was unstable,
 crashing about once a week.  No clues why.  We tried all sorts of things
 in kernel, hardware, drivers, with and without support, to diagnose and
 capture the cause of the crashes.  Then one day, I took a blind stab in the
 dark (for the ninetieth time) and I reformatted the storage volume ext4
 instead of btrfs.  After that, no more crashes.  That was approx 8 months ago.

Even negative results are results. I'm sure the BTRFS devs would be
interested in your crash dumps. Not saying that you are in any way
obligated to provide them - just pointing out that perhaps you were
hitting some snag that could have been resolved (or not).

 I think the only thing I could learn upon a new evaluation is:  #1  I hear
 btrfs send is implemented now.  I'd like to see it with my own eyes before
 I believe it.  #2  I hear quotas (read-only snapshots) are implemented now.
 Again, I'd like to see it before I believe it.  #3  Proven stability.  Never
 seen it yet with btrfs.  Want to see it with my eyes and stand the test of
 time before it earns my trust.

Do not underestimate these guys. They could have come up with a cool new
feature that we haven't heard about anything at all. One of the things
knocking around in my head ever since it was mentioned a while back on
these mailing lists was a metadata-caching device, i.e. a small yet
super-fast small device that would allow you to just store the pool
topology for very fast scrub/resilver. These are the sort of things that
I meant - they could have thought about filesystems in ways that haven't
been done widely before. While BTRFS may be developmentally behind ZFS,
one still has to have great respect for the intellect of its developers
- these guys are not dumb.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nico Williams
 
 I've wanted a system where dedup applies only to blocks being written
 that have a good chance of being dups of others.
 
 I think one way to do this would be to keep a scalable Bloom filter
 (on disk) into which one inserts block hashes.
 
 To decide if a block needs dedup one would first check the Bloom
 filter, then if the block is in it, use the dedup code path, 

How is this different or better than the existing dedup architecture?  If you 
found that some block about to be written in fact matches the hash of an 
existing block on disk, then you've already determined it's a duplicate block, 
exactly as you would, if you had dedup enabled.  In that situation, gosh, it 
sure would be nice to have the extra information like reference count, and 
pointer to the duplicate block, which exists in the dedup table.  

In other words, exactly the way existing dedup is already architected.


 The nice thing about this is that Bloom filters can be sized to fit in
 main memory, and will be much smaller than the DDT.

If you're storing all the hashes of all the blocks, how is that going to be 
smaller than the DDT storing all the hashes of all the blocks?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Edward Harvey
So ... The way things presently are, ideally you would know in advance what 
stuff you were planning to write that has duplicate copies.  You could enable 
dedup, then write all the stuff that's highly duplicated, then turn off dedup 
and write all the non-duplicate stuff.  Obviously, however, this is a fairly 
implausible actual scenario.

In reality, while you're writing, you're going to have duplicate blocks mixed 
in with your non-duplicate blocks, which fundamentally means the system needs 
to be calculating the cksums and entering into DDT, even for the unique 
blocks...  Just because the first time the system sees each duplicate block, it 
doesn't yet know that it's going to be duplicated later.

But as you said, after data is written, and sits around for a while, the 
probability of duplicating unique blocks diminishes over time.  So they're just 
a burden.

I would think, the ideal situation would be to take your idea of un-dedup for 
unique blocks, and take it a step further.  Un-dedup unique blocks that are 
older than some configurable threshold.  Maybe you could have a command for a 
sysadmin to run, to scan the whole pool performing this operation, but it's the 
kind of maintenance that really should be done upon access, too.  Somebody goes 
back and reads a jpg from last year, system reads it and consequently loads the 
DDT entry, discovers that it's unique and has been for a long time, so throw 
out the DDT info.

But, by talking about it, we're just smoking pipe dreams.  Cuz we all know zfs 
is developmentally challenged now.  But one can dream...

finglonger


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Nico Williams
Bloom filters are very small, that's the difference.  You might only need a
few bits per block for a Bloom filter.  Compare to the size of a DDT entry.
 A Bloom filter could be cached entirely in main memory.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nico Williams
 
 To decide if a block needs dedup one would first check the Bloom
 filter, then if the block is in it, use the dedup code path, else the
 non-dedup codepath and insert the block in the Bloom filter.  

Sorry, I didn't know what a Bloom filter was before I replied before - Now I've 
read the wikipedia article and am consequently an expert.   *sic*   ;-)

It sounds like, what you're describing...  The first time some data gets 
written, it will not produce a hit in the Bloom filter, so it will get written 
to disk without dedup.  But now it has an entry in the Bloom filter.  So the 
second time the data block gets written (the first duplicate) it will produce a 
hit in the Bloom filter, and consequently get a dedup DDT entry.  But since the 
system didn't dedup the first one, it means the second one still needs to be 
written to disk independently of the first one.  So in effect, you'll always 
miss the first duplicated block write, but you'll successfully dedup n-1 
duplicated blocks.  Which is entirely reasonable, although not strictly 
optimal.  And sometimes you'll get a false positive out of the Bloom filter, so 
sometimes you'll be running the dedup code on blocks which are actually unique, 
but with some intelligently selected parameters such as Bloom table size, you 
can get this probability to be reasonably small, like less tha
 n 1%.

In the wikipedia article, they say you can't remove an entry from the Bloom 
filter table, which would over time cause consistent increase of false positive 
probability (approaching 100% false positives) from the Bloom filter and 
consequently high probability of dedup'ing blocks that are actually unique; but 
with even a minimal amount of thinking about it, I'm quite sure that's a 
solvable implementation detail.  Instead of storing a single bit for each entry 
in the table, store a counter.  Every time you create a new entry in the table, 
increment the different locations; every time you remove an entry from the 
table, decrement.  Obviously a counter requires more bits than a bit, but it's 
a linear increase of size, exponential increase of utility, and within the 
implementation limits of available hardware.  But there may be a more 
intelligent way of accomplishing the same goal.  (Like I said, I've only 
thought about this minimally).

Meh, well.  Thanks for the interesting thought.  For whatever it's worth.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Tomas Forsman
On 19 January, 2013 - Jim Klimov sent me these 2,0K bytes:

 Hello all,

   While revising my home NAS which had dedup enabled before I gathered
 that its RAM capacity was too puny for the task, I found that there is
 some deduplication among the data bits I uploaded there (makes sense,
 since it holds backups of many of the computers I've worked on - some
 of my homedirs' contents were bound to intersect). However, a lot of
 the blocks are in fact unique - have entries in the DDT with count=1
 and the blkptr_t bit set. In fact they are not deduped, and with my
 pouring of backups complete - they are unlikely to ever become deduped.

Another RFE would be 'zfs dedup mypool/somefs' and basically go through
and do a one-shot dedup. Would be useful in various scenarios. Possibly
go through the entire pool at once, to make dedups intra-datasets (like
the real thing).

/Tomas
-- 
Tomas Forsman, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Jim Klimov

On 2013-01-20 19:55, Tomas Forsman wrote:

On 19 January, 2013 - Jim Klimov sent me these 2,0K bytes:


Hello all,

   While revising my home NAS which had dedup enabled before I gathered
that its RAM capacity was too puny for the task, I found that there is
some deduplication among the data bits I uploaded there (makes sense,
since it holds backups of many of the computers I've worked on - some
of my homedirs' contents were bound to intersect). However, a lot of
the blocks are in fact unique - have entries in the DDT with count=1
and the blkptr_t bit set. In fact they are not deduped, and with my
pouring of backups complete - they are unlikely to ever become deduped.


Another RFE would be 'zfs dedup mypool/somefs' and basically go through
and do a one-shot dedup. Would be useful in various scenarios. Possibly
go through the entire pool at once, to make dedups intra-datasets (like
the real thing).


Yes, but that was asked before =)

Actually, the pool's metadata does contain all the needed bits (i.e.
checksum and size of blocks) such that a scrub-like procedure could
try and find same blocks among unique ones (perhaps with a filter
of this block being referenced from a dataset that currently wants
dedup), throw one out and add a DDT entry to another.

On 2013-01-20 17:16, Edward Harvey wrote:
 So ... The way things presently are, ideally you would know in
 advance what stuff you were planning to write that has duplicate
 copies.  You could enable dedup, then write all the stuff that's
 highly duplicated, then turn off dedup and write all the
 non-duplicate stuff.  Obviously, however, this is a fairly
 implausible actual scenario.

Well, I guess I could script a solution that uses ZDB to dump the
blockpointer tree (about 100Gb of text on my system), and some
perl or sort/uniq/grep parsing over this huge text to find blocks
that are the same but not deduped - as well as those single-copy
deduped ones, and toggle the dedup property while rewriting the
block inside its parent file with DD.

This would all be within current ZFS's capabilities and ultimately
reach the goals of deduping pre-existing data as well as dropping
unique blocks from the DDT. It would certainly not be a real-time
solution (likely might take months on my box - just fetching the
BP tree took a couple of days) and would require more resources
than needed otherwise (rewrites of same userdata, storing and
parsing of addresses as text instead of binaries, etc.)

But I do see how this is doable even today even by a non-expert ;)
(Not sure I'd ever get around to actually doing this thus, though -
it is not a very clean solution nor a performant one).

As a bonus, however, this ZDB dump would also provide an answer
to a frequently-asked question: which files on my system intersect
or are the same - and have some/all blocks in common via dedup?
Knowledge of this answer might help admins with some policy
decisions, be it witch-hunt for hoarders of same files or some
pattern-making to determine which datasets should keep dedup=on...

My few cents,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Richard Elling
On Jan 20, 2013, at 8:16 AM, Edward Harvey imaginat...@nedharvey.com wrote:
 But, by talking about it, we're just smoking pipe dreams.  Cuz we all know 
 zfs is developmentally challenged now.  But one can dream...

I disagree the ZFS is developmentally challenged. There is more development
now than ever in every way: # of developers, companies, OSes, KLOCs, features.
Perhaps the level of maturity makes progress appear to be moving slower than 
it seems in early life?

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Jim Klimov

On 2013-01-20 17:16, Edward Harvey wrote:

But, by talking about it, we're just smoking pipe dreams.  Cuz we all know zfs 
is developmentally challenged now.  But one can dream...


I beg to disagree. While most of my contribution was so far about
learning stuff and sharing with others, as well as planting some
new ideas and (hopefully, seen as constructively) doubting others -
including the implementation we have now - and I do have yet to
see someone pick up my ideas and turn them into code (or prove
why they are rubbish) -- overall I can't say that development
stagnated by some metric of stagnation or activity.

Yes, maybe there were more cool new things per year popping up
with Sun's concentrated engineering talent and financing, but now
it seems that most players - wherever they work now - took a pause
from the marathon, to refine what was done in the decade before.
And this is just as important as churning out innovations faster
than people can comprehend or audit or use them.

As a loud example of present active development - take the LZ4
quests completed by Saso recently. From what I gather, this is a
single man's job done on-line in the view of fellow list members
over a few months, almost like a reality-show; and I guess anyone
with enough concentration, time and devotion could do likewise.

I suspect many of my proposals to the list might also take some
half of a man-year to complete. Unfortunately for the community
and for part of myself, I now have some higher daily priorities
so that I likely won't sit down and code lots of stuff in the
nearest years (until that Priority goes to school, or so). Maybe
that's why I'm eager to suggest quests for brilliant coders here
who can complete the job better and faster than I ever would ;)
So I'm doing the next best things I can do to help the progress :)

And I don't believe this is in vain, that the development ceased
and my writings are only destined to be stuffed under the carpet.
Be it these RFEs or dome others, better and more useful, I believe
they shall be coded and published in common ZFS code. Sometime...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Tim Cook
On Sun, Jan 20, 2013 at 6:19 PM, Richard Elling richard.ell...@gmail.comwrote:

 On Jan 20, 2013, at 8:16 AM, Edward Harvey imaginat...@nedharvey.com
 wrote:
  But, by talking about it, we're just smoking pipe dreams.  Cuz we all
 know zfs is developmentally challenged now.  But one can dream...

 I disagree the ZFS is developmentally challenged. There is more development
 now than ever in every way: # of developers, companies, OSes, KLOCs,
 features.
 Perhaps the level of maturity makes progress appear to be moving slower
 than
 it seems in early life?

  -- richard


Well, perhaps a part of it is marketing.   Maturity isn't really an excuse
for not having a long-term feature roadmap.  It seems as though maturity
in this case equals stagnation.  What are the features being worked on we
aren't aware of?  The big ones that come to mind that everyone else is
talking about for not just ZFS but openindiana as a whole and other storage
platforms would be:
1. SMB3 - hyper-v WILL be gaining market share over the next couple years,
not supporting it means giving up a sizeable portion of the market.  Not to
mention finally being able to run SQL (again) and Exchange on a fileshare.
2. VAAI support.
3. the long-sought bp-rewrite.
4. full drive encryption support.
5. tiering (although I'd argue caching is superior, it's still a checkbox).

There's obviously more, but those are just ones off the top of my head that
others are supporting/working on.  Again, it just feels like all the work
is going into fixing bugs and refining what is there, not adding new
features.  Obviously Saso personally added features, but overall there
don't seem to be a ton of announcements to the list about features that
have been added or are being actively worked on.  It feels like all these
companies are just adding niche functionality they need that may or may not
be getting pushed back to mainline.

/debbie-downer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Richard Elling
On Jan 20, 2013, at 4:51 PM, Tim Cook t...@cook.ms wrote:

 On Sun, Jan 20, 2013 at 6:19 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 On Jan 20, 2013, at 8:16 AM, Edward Harvey imaginat...@nedharvey.com wrote:
  But, by talking about it, we're just smoking pipe dreams.  Cuz we all know 
  zfs is developmentally challenged now.  But one can dream...
 
 I disagree the ZFS is developmentally challenged. There is more development
 now than ever in every way: # of developers, companies, OSes, KLOCs, features.
 Perhaps the level of maturity makes progress appear to be moving slower than
 it seems in early life?
 
  -- richard
 
 Well, perhaps a part of it is marketing.  

A lot of it is marketing :-/

 Maturity isn't really an excuse for not having a long-term feature roadmap.  
 It seems as though maturity in this case equals stagnation.  What are the 
 features being worked on we aren't aware of?

Most of the illumos-centric discussion is on the developer's list. The 
ZFSonLinux 
and BSD communities are also quite active. Almost none of the ZFS developers 
hang
out on this zfs-discuss@opensolaris.org anymore. In fact, I wonder why I'm 
still here...

  The big ones that come to mind that everyone else is talking about for not 
 just ZFS but openindiana as a whole and other storage platforms would be:
 1. SMB3 - hyper-v WILL be gaining market share over the next couple years, 
 not supporting it means giving up a sizeable portion of the market.  Not to 
 mention finally being able to run SQL (again) and Exchange on a fileshare.

I know of at least one illumos community company working on this. However, I do 
not
know their public plans.

 2. VAAI support.  

VAAI has 4 features, 3 of which have been in illumos for a long time. The 
remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor 
product, 
but the CEO made a conscious (and unpopular) decision to keep that code from 
the 
community. Over the summer, another developer picked up the work in the 
community, 
but I've lost track of the progress and haven't seen an RTI yet.

 3. the long-sought bp-rewrite.

Go for it!

 4. full drive encryption support.

This is a key management issue mostly. Unfortunately, the open source code for
handling this (trousers) covers much more than keyed disks and can be unwieldy.
I'm not sure which distros picked up trousers, but it doesn't belong in the 
illumos-gate
and it doesn't expose itself to ZFS.

 5. tiering (although I'd argue caching is superior, it's still a checkbox).

You want to add tiering to the OS? That has been available for a long time via 
the
(defunct?) SAM-QFS project that actually delivered code
http://hub.opensolaris.org/bin/view/Project+samqfs/

If you want to add it to ZFS, that is a different conversation.
 -- richard

 
 There's obviously more, but those are just ones off the top of my head that 
 others are supporting/working on.  Again, it just feels like all the work is 
 going into fixing bugs and refining what is there, not adding new features.  
 Obviously Saso personally added features, but overall there don't seem to be 
 a ton of announcements to the list about features that have been added or are 
 being actively worked on.  It feels like all these companies are just adding 
 niche functionality they need that may or may not be getting pushed back to 
 mainline.
 
 /debbie-downer
 

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-19 Thread Jim Klimov

Hello all,

  While revising my home NAS which had dedup enabled before I gathered
that its RAM capacity was too puny for the task, I found that there is
some deduplication among the data bits I uploaded there (makes sense,
since it holds backups of many of the computers I've worked on - some
of my homedirs' contents were bound to intersect). However, a lot of
the blocks are in fact unique - have entries in the DDT with count=1
and the blkptr_t bit set. In fact they are not deduped, and with my
pouring of backups complete - they are unlikely to ever become deduped.

  Thus these many unique deduped blocks are just a burden when my
system writes into the datasets with dedup enabled, when it walks the
superfluously large DDT, when it has to store this DDT on disk and in
ARC, maybe during the scrubbing... These entries bring lots of headache
(or performance degradation) for zero gain.

  So I thought it would be a nice feature to let ZFS go over the DDT
(I won't care if it requires to offline/export the pool) and evict the
entries with count==1 as well as locate the block-pointer tree entries
on disk and clear the dedup bits, making such blocks into regular unique
ones. This would require rewriting metadata (less DDT, new blockpointer)
but should not touch or reallocate the already-saved userdata (blocks'
contents) on the disk. The new BP without the dedup bit set would have
the same contents of other fields (though its parents would of course
have to be changed more - new DVAs, new checksums...)

  In the end my pool would only track as deduped those blocks which do
already have two or more references - which, given the static nature
of such backup box, should be enough (i.e. new full backups of the same
source data would remain deduped and use no extra space, while unique
data won't waste the resources being accounted as deduped).

What do you think?
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-19 Thread Nico Williams
I've wanted a system where dedup applies only to blocks being written
that have a good chance of being dups of others.

I think one way to do this would be to keep a scalable Bloom filter
(on disk) into which one inserts block hashes.

To decide if a block needs dedup one would first check the Bloom
filter, then if the block is in it, use the dedup code path, else the
non-dedup codepath and insert the block in the Bloom filter.  This
means that the filesystem would store *two* copies of any
deduplicatious block, with one of those not being in the DDT.

This would allow most writes of non-duplicate blocks to be faster than
normal dedup writes, but still slower than normal non-dedup writes:
the Bloom filter will add some cost.

The nice thing about this is that Bloom filters can be sized to fit in
main memory, and will be much smaller than the DDT.

It's very likely that this is a bit too obvious to just work.

Of course, it is easier to just use flash.  It's also easier to just
not dedup: the most highly deduplicatious data (VM images) is
relatively easy to manage using clones and snapshots, to a point
anyways.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-19 Thread Richard Elling
bloom filters are a great fit for this :-)

  -- richard



On Jan 19, 2013, at 5:59 PM, Nico Williams n...@cryptonector.com wrote:

 I've wanted a system where dedup applies only to blocks being written
 that have a good chance of being dups of others.
 
 I think one way to do this would be to keep a scalable Bloom filter
 (on disk) into which one inserts block hashes.
 
 To decide if a block needs dedup one would first check the Bloom
 filter, then if the block is in it, use the dedup code path, else the
 non-dedup codepath and insert the block in the Bloom filter.  This
 means that the filesystem would store *two* copies of any
 deduplicatious block, with one of those not being in the DDT.
 
 This would allow most writes of non-duplicate blocks to be faster than
 normal dedup writes, but still slower than normal non-dedup writes:
 the Bloom filter will add some cost.
 
 The nice thing about this is that Bloom filters can be sized to fit in
 main memory, and will be much smaller than the DDT.
 
 It's very likely that this is a bit too obvious to just work.
 
 Of course, it is easier to just use flash.  It's also easier to just
 not dedup: the most highly deduplicatious data (VM images) is
 relatively easy to manage using clones and snapshots, to a point
 anyways.
 
 Nico
 --
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss