Re: [zfs-discuss] Encryption?

2010-07-10 Thread Edho P Arief
On Sun, Jul 11, 2010 at 11:51 AM, Michael Johnson
 wrote:
> I'm planning on running FreeBSD in VirtualBox (with a Linux host) and giving
> it raw disk access to four drives, which I plan to configure as a raidz2
> volume.
> On top of that, I'm considering using encryption.  I understand that ZFS
> doesn't yet natively support encryption, so my idea was to set each drive up
> with full-disk encryption in the Linux host (e.g., using TrueCrypt or
> dmcrypt), mount the encrypted drives, and then give the virtual machine
> access to the virtual unencrypted drives.  So the encryption would be
> transparent to FreeBSD.
> However, I don't know enough about ZFS to know if this is a good idea.  I
> know that I need to specifically configure VirtualBox to respect cache
> flushes, so that data really is on disk when ZFS expects it to be.  Would
> putting ZFS on top of full-disk encryption like this cause any problems?
>  E.g., if the (encrypted) physical disk has a problem and as a result a
> larger chunk of the unencrypted data is corrupted, would ZFS handle that
> well?  Are there any other possible consequences of this idea that I should
> know about?  (I'm not too worried about any hits in performance; I won't be
> reading or writing heavily, nor in time-sensitive applications.)
> I should add that since this is a desktop I'm not nearly as worried about
> encryption as if it were a laptop (theft or loss are less likely), but
> encryption would still be nice.  However, data integrity is the most
> important thing (I'm storing backups of my personal files on this), so if
> there's a chance that ZFS wouldn't handle errors well when on top of
> encryption, I'll just go without it.
> Thanks,
> Michael
>

you can also create zfs on top of GELI[1][2] devices. Create the
encrypted disks first and then use that to create zpool.

Exact steps (assuming single disk, da1):

- create the key
# dd if=/dev/random of=/root/da1.key bs=64 count=1

- initialize GELI disk, if you want to only use the key as
authentication method or automatically attach on boot, check the
reference links for initialization and configuration (-K and -b)
# geli init -s 4096 -K da1.key /dev/da1

- attach GELI disk
# geli attach -k da1.key /dev/da1

- create zpool, either directly on geli disk or by creating it on top of GPT
>>direct:
# zpool create securepool da1.eli

>>on top of GPT:
# gpart create -s gpt da1.eli
# gpart add -t freebsd-zfs da1.eli
# zpool create securepool da1.elip1

- adjust rc.conf and loader.conf accordingly

Another tutorial: http://forums.freebsd.org/showthread.php?t=2775

[1] 
http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/disks-encrypting.html

[2] 
http://www.freebsd.org/cgi/man.cgi?query=geli&apropos=0&sektion=0&manpath=FreeBSD+8.0-RELEASE&format=html

-- 
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA 6G controller for OSOL

2010-07-10 Thread Marc Bevand
Graham McArdle  ccfe.ac.uk> writes:
> 
> This thread from Marc Bevand and his blog linked therein might have some 
useful alternative suggestions.
> http://opensolaris.org/jive/thread.jspa?messageID=480925
> I've bookmarked it because it's quite a handy summary and I hope he keeps 
updating it with new info

Yes I will!

-mrbsun

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Encryption?

2010-07-10 Thread Michael Johnson
I'm planning on running FreeBSD in VirtualBox (with a Linux host) and giving it 
raw disk access to four drives, which I plan to configure as a raidz2 volume.

On top of that, I'm considering using encryption.  I understand that ZFS 
doesn't 
yet natively support encryption, so my idea was to set each drive up with 
full-disk encryption in the Linux host (e.g., using TrueCrypt or dmcrypt), 
mount 
the encrypted drives, and then give the virtual machine access to the virtual 
unencrypted drives.  So the encryption would be transparent to FreeBSD.

However, I don't know enough about ZFS to know if this is a good idea.  I know 
that I need to specifically configure VirtualBox to respect cache flushes, so 
that data really is on disk when ZFS expects it to be.  Would putting ZFS on 
top 
of full-disk encryption like this cause any problems?  E.g., if the (encrypted) 
physical disk has a problem and as a result a larger chunk of the unencrypted 
data is corrupted, would ZFS handle that well?  Are there any other possible 
consequences of this idea that I should know about?  (I'm not too worried about 
any hits in performance; I won't be reading or writing heavily, nor in 
time-sensitive applications.)

I should add that since this is a desktop I'm not nearly as worried about 
encryption as if it were a laptop (theft or loss are less likely), but 
encryption would still be nice.  However, data integrity is the most important 
thing (I'm storing backups of my personal files on this), so if there's a 
chance 
that ZFS wouldn't handle errors well when on top of encryption, I'll just go 
without it.

Thanks,
Michael


  ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Erik Trimble

On 7/10/2010 10:14 AM, Brandon High wrote:
On Sat, Jul 10, 2010 at 5:33 AM, Erik Trimble > wrote:


Which brings up an interesting idea:   if I have a pool with good
random I/O  (perhaps made from SSDs, or even one of those nifty
Oracle F5100 things),  I would probably not want to have a DDT
created, or at least have one that was very significantly
abbreviated.   What capability does ZFS have for recognizing that
we won't need a full DDT created for high-I/O-speed pools?
 Particularly with the fact that such pools would almost certainly
be heavy candidates for dedup (the $/GB being significantly higher
than other mediums, and thus space being at a premium) ?


I'm not exactly sure what problem you're trying to solve. Dedup is to 
save space, not accelerate i/o. While the DDT is pool-wide, only data 
that's added to datasets with dedup enabled will create entries in the 
DDT. If there's data that you don't want to dedup, then don't add it 
to a pool with dedup enabled.




What I'm talking about here is that caching the DDT in the ARC takes a 
non-trivial amount of space (as we've discovered). For a pool consisting 
of backing store with access times very close to that of main memory, 
there's no real benefit from caching it in the ARC/L2ARC, so it would be 
useful if the DDT was simply kept somewhere on the actual backing store, 
and there was some way to tell ZFS to look there exclusively, and not 
try to build/store a DDT in ARC.





I'm not up on exactly how the DDT gets built and referenced to
understand how this might happen.  But, I can certainly see it as
being useful to tell ZFS (perhaps through a pool property?) that
building an in-ARC DDT isn't really needed.


The DDT is in the pool, not in the ARC. Because it's frequently 
accessed, some / most of it will reside in the ARC.


-B

--
Brandon High : bh...@freaks.com 


Are you sure? I was under the impression that the DDT had to be built 
from info in the pool, but that what we call the DDT only exists in the 
ARC.  That's my understanding from reading the ddt.h and ddt.c files - 
that the 'ddt_enty' and 'ddt' structures exist in RAM/ARC/L2ARC, but not 
on disk. Those two are built using the 'ddt_key' and 'ddt_bookmark' 
structures on disk.


Am I missing something?

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Garrett D'Amore
Even the most expensive decompression algorithms generally run
significantly faster than I/O to disk -- at least when real disks are
involved.  So, as long as you don't run out of CPU and have to wait for
CPU to be available for decompression, the decompression will win.  The
same concept is true for dedup, although I don't necessarily think of
dedup as a form of compression (others might reasonably do so though.)

- Garrett

On Sat, 2010-07-10 at 19:09 -0400, Edward Ned Harvey wrote:
> > From: Roy Sigurd Karlsbakk [mailto:r...@karlsbakk.net]
> > > increases the probability of arc/ram cache hit. So dedup allows you
> > to
> > > stretch your disk, and also stretch your ram cache. Which also
> > > benefits performance.
> > 
> > Theoretically, yes, but there will be an overhead in cpu/memory that
> > can reduce this benefit to a penalty.
> 
> That's why a really fast compression algorithm is used in-line, in hopes that 
> the time cost of compression is smaller than the performance gain of 
> compression.  Take for example, v.42bis and v.44 which was used to accelerate 
> 56K modems.  (Probably still are, if you actually have a modem somewhere.  ;-)
> 
> Nowadays we have faster communication channels; in fact when talking about 
> dedup we're talking about local disk speed, which is really fast.  But we 
> also have fast processors, and the algorithm in question can be really fast.
> 
> I recently benchmarked lzop, gzip, bzip2, and lzma for some important data on 
> our fileserver that I would call "typical."  No matter what I did, lzop was 
> so ridiculously light weight that I could never get lzop up to 100% cpu.  
> Even reading data 100% from cache and filtering through lzop to /dev/null, 
> the kernel overhead of reading ram cache was higher than the cpu overhead to 
> compress.
> 
> For the data in question, lzop compressed to 70%, gzip compressed to 42%, 
> bzip 32%, and lzma something like 16%.  bzip2 was the slowest (by a factor of 
> 4).  lzma -1 and gzip --fast were closely matched in speed but not 
> compression.  So the compression of lzop was really weak for the data in 
> question, but it contributed no significant cpu overhead.  The point is:  
> It's absolutely possible to compress quickly, if you have a fast algorithm, 
> and gain performance.  I'm boldly assuming dedup performs this fast.  It 
> would be nice to actually measure and prove it.
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scrub is clean but still run in checksum errors when sending

2010-07-10 Thread devsk
Funny thing is that if I enable the snapdir and 'cat' the file, it doesn't say 
IO error.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scrub is clean but still run in checksum errors when sending

2010-07-10 Thread devsk
This is the fourth time its happened.

I had a clean scrub before starting send 15 minutes ago. And now, I see 
permanent errors in one of the files of one of the snapshots. Different file 
from last time.

How do I find what's going on? 'dmesg' in guest or kernel does not have any 
relevant entries.

BTW: This is inside VirualBox but I have the host cache disabled.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool scrub is clean but still run in checksum errors when sending

2010-07-10 Thread devsk
'zpool scrub mypool' comes out clean.
zfs send -Rv myp...@blah | ssh ...

reports IO error. And indeed, 'zpool status -v' shows errors in some files in 
an older snapshot.

Repeat scrub without errors and clear the pool. And send now fails on a 
different set of files.

How can this happen?

What would be the best way to troubleshoot this kind of error?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cache flush (or the lack of such) and corruption

2010-07-10 Thread Toby Thain


On 10-Jul-10, at 4:57 PM, Roy Sigurd Karlsbakk wrote:


- Original Message -
Depends on the failure mode. I've spent hundreds (thousands?) of  
hours

attempting to recover data from backup tape because of bad hardware,
firmware,
and file systems. The major difference is that ZFS cares that the  
data

is not
correct, while older file systems did not care about the data.


It still seems like ZFS has a problem with its metadata. Reports of  
loss of pools because of metadata errors is what is worrying me. Can  
you give me any input on how to avoid this?


Roy

It needs to be pointed out that this only causes an integrity problem  
for zfs *when the hardware stack is faulty* (not respecting flush).  
And it obviously then affects all systems which assume working  
barriers (RDBMS, reiser3fs, ext3fs, other journaling systems, etc).


--Toby



Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres  
intelligibelt. Det er et elementært imperativ for alle pedagoger å  
unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de  
fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Edward Ned Harvey
> From: Roy Sigurd Karlsbakk [mailto:r...@karlsbakk.net]
> > increases the probability of arc/ram cache hit. So dedup allows you
> to
> > stretch your disk, and also stretch your ram cache. Which also
> > benefits performance.
> 
> Theoretically, yes, but there will be an overhead in cpu/memory that
> can reduce this benefit to a penalty.

That's why a really fast compression algorithm is used in-line, in hopes that 
the time cost of compression is smaller than the performance gain of 
compression.  Take for example, v.42bis and v.44 which was used to accelerate 
56K modems.  (Probably still are, if you actually have a modem somewhere.  ;-)

Nowadays we have faster communication channels; in fact when talking about 
dedup we're talking about local disk speed, which is really fast.  But we also 
have fast processors, and the algorithm in question can be really fast.

I recently benchmarked lzop, gzip, bzip2, and lzma for some important data on 
our fileserver that I would call "typical."  No matter what I did, lzop was so 
ridiculously light weight that I could never get lzop up to 100% cpu.  Even 
reading data 100% from cache and filtering through lzop to /dev/null, the 
kernel overhead of reading ram cache was higher than the cpu overhead to 
compress.

For the data in question, lzop compressed to 70%, gzip compressed to 42%, bzip 
32%, and lzma something like 16%.  bzip2 was the slowest (by a factor of 4).  
lzma -1 and gzip --fast were closely matched in speed but not compression.  So 
the compression of lzop was really weak for the data in question, but it 
contributed no significant cpu overhead.  The point is:  It's absolutely 
possible to compress quickly, if you have a fast algorithm, and gain 
performance.  I'm boldly assuming dedup performs this fast.  It would be nice 
to actually measure and prove it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cache flush (or the lack of such) and corruption

2010-07-10 Thread Bob Friesenhahn

On Sat, 10 Jul 2010, Roy Sigurd Karlsbakk wrote:

I've been reading a lot of messages on this list about potential and 
actual corruption of a zpool due to cache flush problems and 
whatnot, and I find myself amazed.


You should not be amazed.  People only take their radio in for repair 
when it breaks.  Therefore, the radio repair man only experiences 
broken radios.  Meanwhile, millions of radios continue to work without 
fail.  It is the same on discussion lists and forums.


I just wonder how a zpool compares with a good old filesystem when 
it comes to filesystem errors. It seems several of the members of 
this list have encountered problems where they had to boot a live CD 
to get their pool back, whereas a normal filesystem won't give this 
problem. The old-time filesystem might have corrupted data, but it 
still gets up.


Most "old-time filesystems" are tremendously smaller than today's zfs 
storage pools, and they might even be on just one disk.  Regardless, 
only someone with severely failing memory might think that "old-time 
filesystems" are somehow less failure prone than a zfs storage pool. 
The "good old days" does not apply to filesystems.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cache flush (or the lack of such) and corruption

2010-07-10 Thread Roy Sigurd Karlsbakk
- Original Message -
> Depends on the failure mode. I've spent hundreds (thousands?) of hours
> attempting to recover data from backup tape because of bad hardware,
> firmware,
> and file systems. The major difference is that ZFS cares that the data
> is not
> correct, while older file systems did not care about the data.

It still seems like ZFS has a problem with its metadata. Reports of loss of 
pools because of metadata errors is what is worrying me. Can you give me any 
input on how to avoid this?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub extremely slow?

2010-07-10 Thread Hernan F
Too bad then, I can't afford a couple of SSDs for this machine as it's just a 
home file server. I'm surprised about the scrub speed though... This used to be 
a 4x500GB machine, to which I replaced the disks one by one. Resilver (about 
80% full) took about 6 hours to complete - now it's twice the size and it's 
taking 10x. 

Guess I'll just have to wait for a fix.

Thanks!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cache flush (or the lack of such) and corruption

2010-07-10 Thread Richard Elling
On Jul 10, 2010, at 12:53 PM, Roy Sigurd Karlsbakk wrote:
> Hi all
> 
> I've been reading a lot of messages on this list about potential and actual 
> corruption of a zpool due to cache flush problems and whatnot, and I find 
> myself amazed.

Why are you amazed?  Storage devices have been losing data since they
were invented :-P

> I just wonder how a zpool compares with a good old filesystem when it comes 
> to filesystem errors. It seems several of the members of this list have 
> encountered problems where they had to boot a live CD to get their pool back, 
> whereas a normal filesystem won't give this problem. The old-time filesystem 
> might have corrupted data, but it still gets up.

Depends on the failure mode.  I've spent hundreds (thousands?) of hours 
attempting to recover data from backup tape because of bad hardware, firmware,
and file systems. The major difference is that ZFS cares that the data is not
correct, while older file systems did not care about the data. 

[no, fsck does not correct data errors]

> Can someone give me some good input on this, and perhaps how to avoid an 
> enitre pool to become unavailable?

Use good quality hardware and redundancy.

> PS: I'm using small RAIDz2 pools with sufficient amount of redundancy

Good.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Cache flush (or the lack of such) and corruption

2010-07-10 Thread Roy Sigurd Karlsbakk
Hi all

I've been reading a lot of messages on this list about potential and actual 
corruption of a zpool due to cache flush problems and whatnot, and I find 
myself amazed.

I just wonder how a zpool compares with a good old filesystem when it comes to 
filesystem errors. It seems several of the members of this list have 
encountered problems where they had to boot a live CD to get their pool back, 
whereas a normal filesystem won't give this problem. The old-time filesystem 
might have corrupted data, but it still gets up.

Can someone give me some good input on this, and perhaps how to avoid an enitre 
pool to become unavailable?

PS: I'm using small RAIDz2 pools with sufficient amount of redundancy

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub extremely slow?

2010-07-10 Thread Roy Sigurd Karlsbakk
- Original Message -
> I tested with Bonnie++ and it reports about 200MB/s.
> 
> The pool version is 22 (SunOS solaris 5.11 snv_134 i86pc i386 i86pc
> Solaris)
> 
> I let the scrub run for hours and it was still at around 10MB/s. I
> tried to access an iSCSI target on that pool and it was really really
> slow (about 600KB/s!) while the scrub is running.

iSCSI access (and ZFS) during scrub is quite bad. If you add an SLOG (on SSD), 
this will help out. Also, adding an L2ARC SSD (or two) will help further. The 
scrub performance issues have been worked on in later ON versions, but then, 
Oracle doesn't seem to want to release those as distros, which is a bitch.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Roy Sigurd Karlsbakk
> > 4% seems to be a pretty good SWAG.
> 
> Is the above "4%" wrong, or am I wrong?
> 
> Suppose 200bytes to 400bytes, per 128Kbyte block ...
> 200/131072 = 0.0015 = 0.15%
> 400/131072 = 0.003 = 0.3%
> which would mean for 100G unique data = 153M to 312M ram.
> 
> Around 3G ram for 1Tb unique data, assuming default 128K block

Recodsize means maximum block size. Smaller files will be stored in smaller 
blocks. With lots of files of different sizes, the block size will generally be 
smaller than the recordsize set for ZFS.

> Next question:
> 
> Correct me if I'm wrong, if you have a lot of duplicated data, then
> dedup
> increases the probability of arc/ram cache hit. So dedup allows you to
> stretch your disk, and also stretch your ram cache. Which also
> benefits performance.

Theoretically, yes, but there will be an overhead in cpu/memory that can reduce 
this benefit to a penalty.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Brandon High
> 
> Dedup is to
> save space, not accelerate i/o. 

I'm going to have to disagree with you there.  Dedup is a type of
compression.  Compression can be used for storage savings, and/or
acceleration.  Fast and lightweight compression algorithms (lzop, v.42bis,
v.44) are usually used in-line for acceleration, while a compute-expensive
algorithms (bzip2, lzma, gzip) are usually used for space savings and rarely
for acceleration (except when transmitting data across a slow channel).

Most general-purpose lossless compression algorithms (and certainly most of
the ones I just mentioned) achieve compression by reducing duplicated data.
There are special purpose lossless (flac etc) and lossy (jpg, mp3 etc) which
do other techniques.  But general purpose compression might possibly even be
exclusively algorithms for reduction of repeated data.

Unless I'm somehow mistaken, the performance benefit of dedup comes from the
fact that it increases cache hits.  Instead of having to read a thousand
duplicate blocks from different sectors of disks, you read it once, and the
other 999 have all been stored "same as" the original block, so it's 999
cache hits and unnecessary to read disk again.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub extremely slow?

2010-07-10 Thread Hernan F
I tested with Bonnie++ and it reports about 200MB/s. 

The pool version is 22 (SunOS solaris 5.11 snv_134 i86pc i386 i86pc Solaris)

I let the scrub run for hours and it was still at around 10MB/s. I tried to 
access an iSCSI target on that pool and it was really really slow (about 
600KB/s!) while the scrub is running.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Edward Ned Harvey
> From: Richard Elling [mailto:rich...@nexenta.com]
> 
> 4% seems to be a pretty good SWAG.

Is the above "4%" wrong, or am I wrong?

Suppose 200bytes to 400bytes, per 128Kbyte block ... 
200/131072 = 0.0015 = 0.15%
400/131072 = 0.003 = 0.3%
which would mean for 100G unique data = 153M to 312M ram.

Around 3G ram for 1Tb unique data, assuming default 128K block

Next question:

Correct me if I'm wrong, if you have a lot of duplicated data, then dedup
increases the probability of arc/ram cache hit.  So dedup allows you to
stretch your disk, and also stretch your ram cache.  Which also benefits
performance.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Legality and the future of zfs...

2010-07-10 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Peter Taps
> 
> A few companies have already backed out of zfs
> as they cannot afford to go through a lawsuit. 

Or, in the case of Apple, who could definitely afford a lawsuit, but choose
to avoid it anyway.


> I am in a stealth
> startup company and we rely on zfs for our application. The future of
> our company, and many other businesses, depends on what happens to zfs.

For a lot of purposes, ZFS is the clear best solution.  But maybe you're not
necessarily in one of those situations?  Perhaps you could use Microsoft
VSS, or Linux BTRFS?

'Course, by all rights, those are copy-on-write too.  So why doesn't netapp
have a lawsuit against kernel.org, or microsoft?  Maybe cuz they just know
they'll damage their own business too much by suing Linus, and they can't
afford to go up against MS.  I guess.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub extremely slow?

2010-07-10 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Hernan F
> Subject: [zfs-discuss] Scrub extremely slow?

Perhaps this is related?
http://hub.opensolaris.org/bin/view/Community+Group+zfs/11
Zpool version 11, introduced "improved scrub performance"


> Hello, I'm trying to figure out why I'm getting about 10MB/s scrubs, on
> a pool where I can easily get 100MB/s. It's 4x 1TB SATA2 (nv_sata),
> raidz. Athlon64 with 8GB RAM.

With sata disks, I expect approx 500Mbit/sec per disk for sustainable
sequential operations.  So I would agree you should easily be able to do
100MB/s = 800Mbit/sec.  You should actually be able to do something around
2x that.

I don't know why your scrub is slow.  Perhaps if you sit and watch it
longer, you'll see sometimes it's high and sometimes it's low? 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Brandon High
On Sat, Jul 10, 2010 at 5:33 AM, Erik Trimble wrote:

> Which brings up an interesting idea:   if I have a pool with good random
> I/O  (perhaps made from SSDs, or even one of those nifty Oracle F5100
> things),  I would probably not want to have a DDT created, or at least have
> one that was very significantly abbreviated.   What capability does ZFS have
> for recognizing that we won't need a full DDT created for high-I/O-speed
> pools?  Particularly with the fact that such pools would almost certainly be
> heavy candidates for dedup (the $/GB being significantly higher than other
> mediums, and thus space being at a premium) ?
>

I'm not exactly sure what problem you're trying to solve. Dedup is to save
space, not accelerate i/o. While the DDT is pool-wide, only data that's
added to datasets with dedup enabled will create entries in the DDT. If
there's data that you don't want to dedup, then don't add it to a pool with
dedup enabled.

I'm not up on exactly how the DDT gets built and referenced to understand
> how this might happen.  But, I can certainly see it as being useful to tell
> ZFS (perhaps through a pool property?) that building an in-ARC DDT isn't
> really needed.
>

The DDT is in the pool, not in the ARC. Because it's frequently accessed,
some / most of it will reside in the ARC.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Should i enable Write-Cache ?

2010-07-10 Thread Ross Walker
On Jul 10, 2010, at 5:46 AM, Erik Trimble  wrote:

> On 7/10/2010 1:14 AM, Graham McArdle wrote:
>>> Instead, create "Single Disk" arrays for each disk.
>>> 
>> I have a question related to this but with a different controller: If I'm 
>> using a RAID controller to provide non-RAID single-disk volumes, do I still 
>> lose out on the hardware-independence advantage of software RAID that I 
>> would get from a basic non-RAID HBA?
>> In other words, if the controller dies, would I still need an identical 
>> controller to recognise the formatting of 'single disk volumes', or is more 
>> 'standardised' than the typical proprietary implementations of hardware RAID 
>> that makes it impossible to switch controllers on  hardware RAID?
>>   
> 
> Yep. You're screwed.  :-)
> 
> single-disk volumes are still RAID volumes to the controller, so they'll have 
> the extra controller-specific bits on them. You'll need an identical 
> controller (or, possibly, just one from the same OEM) to replace a broken 
> controller with.
> 
> Even in JBOD mode, I wouldn't trust a RAID controller to not write 
> proprietary bits onto the disks.  It's one of the big reasons to chose a HBA 
> and not a RAID controller.

Not always, my Dell PERC with the drives set as single disk RAID0 disks, I was 
able to successfully import the pool on a regular LSI SAS (non-RAID) controller.

The only change the PERC made was to coerce the disk size down 128MB, so left 
128MB unused at the end of the drive, which would mean new disks would be 
slightly bigger.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Richard Elling
On Jul 10, 2010, at 5:33 AM, Erik Trimble wrote:

> On 7/10/2010 5:24 AM, Richard Elling wrote:
>> On Jul 9, 2010, at 11:10 PM, Brandon High wrote:
>> 
>>   
>>> On Fri, Jul 9, 2010 at 5:18 PM, Brandon High  wrote:
>>> I think that DDT entries are a little bigger than what you're using. The 
>>> size seems to range between 150 and 250 bytes depending on how it's 
>>> calculated, call it 200b each. Your 128G dataset would require closer to 
>>> 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of 
>>> unique data would require 600M - 1000M for the DDT.
>>> 
>>> Using 376b per entry, it's 376M for 128G of unique data, or just under 3GB 
>>> for 1TB of unique data.
>>> 
>> 4% seems to be a pretty good SWAG.
>> 
>>   
>>> A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the 
>>> DDT. Ouch.
>>> 
>> ... or more than 300GB for 512-byte records.
>> 
>> The performance issue is that DDT access tends to be random. This implies 
>> that
>> if you don't have a lot of RAM and your pool has poor random read I/O 
>> performance,
>> then you will not be impressed with dedup performance. In other words, 
>> trying to
>> dedup lots of data on a small DRAM machine using big, slow pool HDDs will 
>> not set
>> any benchmark records. By contrast, using SSDs for the pool can demonstrate 
>> good
>> random read performance. As the price per bit of HDDs continues to drop, the 
>> value
>> of deduping pools using HDDs also drops.
>>  -- richard
>> 
>>   
> 
> Which brings up an interesting idea:   if I have a pool with good random I/O  
> (perhaps made from SSDs, or even one of those nifty Oracle F5100 things),  I 
> would probably not want to have a DDT created, or at least have one that was 
> very significantly abbreviated.   What capability does ZFS have for 
> recognizing that we won't need a full DDT created for high-I/O-speed pools?  
> Particularly with the fact that such pools would almost certainly be heavy 
> candidates for dedup (the $/GB being significantly higher than other mediums, 
> and thus space being at a premium) ?

Methinks it is impossible to build a complete DDT, we'll run out of atoms... 
maybe if 
we can use strings?  :-)  Think of it as a very, very sparse array.  Otherwise 
it
is managed just like other metadata.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Erik Trimble

On 7/10/2010 5:24 AM, Richard Elling wrote:

On Jul 9, 2010, at 11:10 PM, Brandon High wrote:

   

On Fri, Jul 9, 2010 at 5:18 PM, Brandon High  wrote:
I think that DDT entries are a little bigger than what you're using. The size 
seems to range between 150 and 250 bytes depending on how it's calculated, call 
it 200b each. Your 128G dataset would require closer to 200M (+/- 25%) for the 
DDT if your data was completely unique. 1TB of unique data would require 600M - 
1000M for the DDT.

Using 376b per entry, it's 376M for 128G of unique data, or just under 3GB for 
1TB of unique data.
 

4% seems to be a pretty good SWAG.

   

A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the DDT. 
Ouch.
 

... or more than 300GB for 512-byte records.

The performance issue is that DDT access tends to be random. This implies that
if you don't have a lot of RAM and your pool has poor random read I/O 
performance,
then you will not be impressed with dedup performance. In other words, trying to
dedup lots of data on a small DRAM machine using big, slow pool HDDs will not 
set
any benchmark records. By contrast, using SSDs for the pool can demonstrate good
random read performance. As the price per bit of HDDs continues to drop, the 
value
of deduping pools using HDDs also drops.
  -- richard

   


Which brings up an interesting idea:   if I have a pool with good random 
I/O  (perhaps made from SSDs, or even one of those nifty Oracle F5100 
things),  I would probably not want to have a DDT created, or at least 
have one that was very significantly abbreviated.   What capability does 
ZFS have for recognizing that we won't need a full DDT created for 
high-I/O-speed pools?  Particularly with the fact that such pools would 
almost certainly be heavy candidates for dedup (the $/GB being 
significantly higher than other mediums, and thus space being at a 
premium) ?


I'm not up on exactly how the DDT gets built and referenced to 
understand how this might happen.  But, I can certainly see it as being 
useful to tell ZFS (perhaps through a pool property?) that building an 
in-ARC DDT isn't really needed.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-10 Thread Richard Elling
On Jul 9, 2010, at 11:10 PM, Brandon High wrote:

> On Fri, Jul 9, 2010 at 5:18 PM, Brandon High  wrote:
> I think that DDT entries are a little bigger than what you're using. The size 
> seems to range between 150 and 250 bytes depending on how it's calculated, 
> call it 200b each. Your 128G dataset would require closer to 200M (+/- 25%) 
> for the DDT if your data was completely unique. 1TB of unique data would 
> require 600M - 1000M for the DDT.
> 
> Using 376b per entry, it's 376M for 128G of unique data, or just under 3GB 
> for 1TB of unique data.

4% seems to be a pretty good SWAG.

> A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the 
> DDT. Ouch.

... or more than 300GB for 512-byte records.

The performance issue is that DDT access tends to be random. This implies that
if you don't have a lot of RAM and your pool has poor random read I/O 
performance,
then you will not be impressed with dedup performance. In other words, trying to
dedup lots of data on a small DRAM machine using big, slow pool HDDs will not 
set
any benchmark records. By contrast, using SSDs for the pool can demonstrate good
random read performance. As the price per bit of HDDs continues to drop, the 
value
of deduping pools using HDDs also drops.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Should i enable Write-Cache ?

2010-07-10 Thread Erik Trimble

On 7/10/2010 1:14 AM, Graham McArdle wrote:

Instead, create "Single Disk" arrays for each disk.
 

I have a question related to this but with a different controller: If I'm using 
a RAID controller to provide non-RAID single-disk volumes, do I still lose out 
on the hardware-independence advantage of software RAID that I would get from a 
basic non-RAID HBA?
In other words, if the controller dies, would I still need an identical 
controller to recognise the formatting of 'single disk volumes', or is more 
'standardised' than the typical proprietary implementations of hardware RAID 
that makes it impossible to switch controllers on  hardware RAID?
   


Yep. You're screwed.  :-)

single-disk volumes are still RAID volumes to the controller, so they'll 
have the extra controller-specific bits on them. You'll need an 
identical controller (or, possibly, just one from the same OEM) to 
replace a broken controller with.


Even in JBOD mode, I wouldn't trust a RAID controller to not write 
proprietary bits onto the disks.  It's one of the big reasons to chose a 
HBA and not a RAID controller.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Should i enable Write-Cache ?

2010-07-10 Thread Graham McArdle
> Instead, create "Single Disk" arrays for each disk.

I have a question related to this but with a different controller: If I'm using 
a RAID controller to provide non-RAID single-disk volumes, do I still lose out 
on the hardware-independence advantage of software RAID that I would get from a 
basic non-RAID HBA?
In other words, if the controller dies, would I still need an identical 
controller to recognise the formatting of 'single disk volumes', or is more 
'standardised' than the typical proprietary implementations of hardware RAID 
that makes it impossible to switch controllers on  hardware RAID?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found

2010-07-10 Thread Lo Zio
In any case, have you an idea of the way to solve my current problem? I have 
450Gb in a deduped dataset I want to destroy, and each tentative I do results 
in a machine hang. I just want to destroy a dataset and all of its snapshots!!
I tried unmounting before zfs destroy but had no luck...
Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Legality and the future of zfs...

2010-07-10 Thread Miles Nordin
> "ab" == Alex Blewitt  writes:

>>> 3. The quality of software inside the firewire cases varies
>>> wildly and is a big source of stability problems.  (even on
>>> mac)

ab> It would be good if you could refrain from spreading FUD if
ab> you don't have experience with it. 

yup, my experience was with the Prolific PL-3705 chip, which was very
popular for a while.  it has two problems:

 * it doesn't auto-pick its ``ID number'' or ``address'' or something,
   so if you have two cases with this chip on the same bus, they won't
   work.  go google it!

 * it crashes.  as in, I reboot the computer but not the case, and the
   drive won't mount.  I reboot the case but not the computer, and
   the drive starts working again.

   http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html

   I even upgraded the firmware to give the chinese another shot.
   still broken.

You can easily google for other problems with firewire cases in
general.  The performance of the overall system is all over the place
depending on the bridge chip you use.  Some of them have problems with
``large'' transactions as well.  Some of them lose their shit when the
drive reports bad sectors, instead of passing the error along so you
can usefully diagnose it---not that they're the only devices with
awful exception handling in this area, but why add one more mystery?

I think it was already clear I had experience from the level of detail
in the other items I mentioned, though, wasn't it?

Add also to all of it the cache flush suspicions from Garrett: these
bridge chips have full-on ARM cores inside them and lots of buffers,
which is something SAS multipliers don't have AIUI.  Yeah, in a way
that's slightly FUDdy but not really since IIRC the write cache
problem has been verified at least on some USB cases, hasn't it?  Also
since the testing procedure for cache flush problems is a
littlead-hoc, and a lot of people are therefore putting hardware
to work without testing cache flush at all, I think it makes perfect
sense to replace suspicious components with lengths of dumb wire where
possible even if the suspicions aren't proved.

ab> I have used FW400 and FW800 on Mac systems for the last 8
ab> years; the only problem was with the Oxford 911 chipset in OSX
ab> 10.1 days.

yeah, well, if you don't want to listen, then fine, don't listen.

ab> It may not suit everyone's needs, and it may not be supported
ab> well on OpenSolaris, but it works fine on a Mac.

aside from being slow unstable and expensive, yeah it works fine on
Mac.  But you don't really have the eSATA option on the mac unless you
pay double for the ``pro'' desktop, so i can see why you'd defend your
only choice of disk if you've already committed to apple.

Does the Mac OS even have an interesting zfs port?  Remind me why we
are discussing this, again?


pgpbltDPUUaLy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss