Re: [zfs-discuss] Restricting smb share to specific interfaces

2010-01-05 Thread Jerome Warnier
Tim Cook wrote:


 On Sun, Jan 3, 2010 at 6:58 PM, Jerome Warnier jwarn...@beeznest.net
 mailto:jwarn...@beeznest.net wrote:

 Hi,

 I'm smbsharing ZFS filesystems.
 I know how to restrict access to it to some hosts (and users), but did
 not find any way to forbid the smb protocol being advertised on a
 specific interface (or the other way around, specify the ones I
 agree with).
 Is there any other way than setting up a firewall to filter the
 interface?



 I believe it can be done with Crossbow and Flows by defining cifs as a
 service.
 http://hub.opensolaris.org/bin/view/Project+crossbow/faq#flow_what
  
Isn't there any simpler way?
Using Samba, you can restrict it to interfaces. Isn't there anything
similar on ZFS-CiFS?
 -- 
 --Tim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device

2010-01-05 Thread Eric D. Mudama

On Mon, Jan  4 at 22:01, Thomas Burgess wrote:

  I guess i got some bad advice then
  I was told the kingston snv125-s2 used almost the exact same hardware as
  an x25-m and should be considered the poor mans x25-m

...


  Right, i couldn't find any of the 40 gb's in stock so i ordered the 64
  gb.same exact model, only biggerdoes your previous statement about
  the larger model ssd's not apply to the kingstons?


The SNV125-S2/40GB is the half an X25-M drive which can be often
found as a bare OEM drive for about $85 w/ rebate.

Kingston does sell rebranded Intel SLC drives as well, but under a
different model number: SNE-125S2/32 or SNE-125S2/64.  I don't believe
the 64GB Kingston MLC (SNV-125S2/64) is based on Intel's controller.

The Kingston rebranding of the gen2 intel MLC design is
SNM-125S2B/80 or SNM-125S2B/160.  Those are essentially 34nm Intel
X25-M units I believe.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-05 Thread Eric D. Mudama

On Mon, Jan  4 at 16:43, Wes Felter wrote:

Eric D. Mudama wrote:


I am not convinced that a general purpose CPU, running other software
in parallel, will be able to be timely and responsive enough to
maximize bandwidth in an SSD controller without specialized hardware
support.


Fusion-io would seem to be a counter-example, since it uses a fairly 
simple controller (I guess the controller still performs ECC and 
maybe XOR) and the driver eats a whole x86 core. The result is very 
high performance.


Wes Felter


I see what you're saying, but it isn't obvious (to me) how well
they're using all the hardware at hand.  2GB/s of bandwidth over their
PCI-e link and what looks like a TON of NAND, with a nearly-dedicated
x86 core...  resuting in 600MB/s or something like that?

While the number is very good for NAND flash SSDs, it seems like a TON
of horsepower going to waste, and they still have a large onboard
controller/FPGA.  I guess enough CPU can make the units faster, but
i'm just not sold.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device

2010-01-05 Thread Thomas Burgess
The SNV125-S2/40GB is the half an X25-M drive which can be often

 found as a bare OEM drive for about $85 w/ rebate.

 Kingston does sell rebranded Intel SLC drives as well, but under a
 different model number: SNE-125S2/32 or SNE-125S2/64.  I don't believe
 the 64GB Kingston MLC (SNV-125S2/64) is based on Intel's controller.

 The Kingston rebranding of the gen2 intel MLC design is
 SNM-125S2B/80 or SNM-125S2B/160.  Those are essentially 34nm Intel
 X25-M units I believe.
  edmud...@mail.bounceswoosh.org

yes, ssd model numbers are purposely confusing and deceitful i think.

All in all, i have noone but myself to blame...and even with this mishap the
ssd isn't not worth the money

the 64 gb version is based on the second revision of the dreaded jmicron
controller but according the my new research, the original issues with this
controller were fixed before the release of this ssd.so apparently they
DO perform as expected.

worst case scenario, i can use 2 to mirror my rpool and 1 for cheap l2arc.

I also noticed intel sells a cheaper model of the x25
i want to say it was x25-v but i might be wrong...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-05 Thread Andrey Kuzmin
600? I've heard 1.5GBps reported.

On 1/5/10, Eric D. Mudama edmud...@bounceswoosh.org wrote:
 On Mon, Jan  4 at 16:43, Wes Felter wrote:
Eric D. Mudama wrote:

I am not convinced that a general purpose CPU, running other software
in parallel, will be able to be timely and responsive enough to
maximize bandwidth in an SSD controller without specialized hardware
support.

Fusion-io would seem to be a counter-example, since it uses a fairly
simple controller (I guess the controller still performs ECC and
maybe XOR) and the driver eats a whole x86 core. The result is very
high performance.

Wes Felter

 I see what you're saying, but it isn't obvious (to me) how well
 they're using all the hardware at hand.  2GB/s of bandwidth over their
 PCI-e link and what looks like a TON of NAND, with a nearly-dedicated
 x86 core...  resuting in 600MB/s or something like that?

 While the number is very good for NAND flash SSDs, it seems like a TON
 of horsepower going to waste, and they still have a large onboard
 controller/FPGA.  I guess enough CPU can make the units faster, but
 i'm just not sold.

 --
 Eric D. Mudama
 edmud...@mail.bounceswoosh.org

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



-- 
Regards,
Andrey
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] send/recv, apparent data loss

2010-01-05 Thread Michael Herf
I replayed a bunch of filesystems in order to get dedupe benefits.
Only thing is a couple of them are rolled back to November or so (and
I didn't notice before destroy'ing the old copy).

I used something like:

zfs snapshot pool/f...@dd
zfs send -Rp pool/f...@dd | zfs recv -d pool/fs2
(after done...)
zfs destroy pool/fs
zfs rename pool/fs2/fs pool/fs

What are the failure modes for partial send/recv? I've experienced
full rollbacks when the process is canceled.
But my case feels like the stream became truncated and the filesystem
ended up partially built? Is this an expected result?

It does seem like ZFS needs a way to do this kind of operation
atomically in the future, but I'm more interested in understanding if
there's something I did wrong using the current tools, or if there are
bugs.

I was running b130 to do these operations, and it seems like previous
attempts in b128 and b129 completed successfully.

mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device

2010-01-05 Thread Joerg Schilling
Chris Du dilid...@gmail.com wrote:

 You can use the utility to erase all blocks and regain performance, but it's 
 a manual process and quite complex. Windows 7 support TRIM, if SSD firmware 
 also supports it, the process is run in the background so you will not notice 
 performance degrade. I don't think any other OS supports TRIM.

IIRC, Lnux alsi supports the TRIM command.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] send/recv, apparent data loss

2010-01-05 Thread Ian Collins

Michael Herf wrote:

I replayed a bunch of filesystems in order to get dedupe benefits.
Only thing is a couple of them are rolled back to November or so (and
I didn't notice before destroy'ing the old copy).

I used something like:

zfs snapshot pool/f...@dd
zfs send -Rp pool/f...@dd | zfs recv -d pool/fs2
(after done...)
zfs destroy pool/fs
zfs rename pool/fs2/fs pool/fs

What are the failure modes for partial send/recv? I've experienced
full rollbacks when the process is canceled.
But my case feels like the stream became truncated and the filesystem
ended up partially built? Is this an expected result?
  

Individual receives should be atomic.


It does seem like ZFS needs a way to do this kind of operation
atomically in the future, but I'm more interested in understanding if
there's something I did wrong using the current tools, or if there are
bugs.
  
Was there any error output?  I always use -v on recursive receives to 
track progress.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] send/recv, apparent data loss

2010-01-05 Thread Michael Herf
I didn't use -v, so I don't know.
I just waited until the process exited, assuming it would succeed or
fail. The sizes looked equivalent, so I went ahead with the destroy,
rename.

For the jobs a couple weeks ago, I turned off the snapshot service.
For this one, I probably left it on. Anything possible there?

The only other thing is that I did zfs rollback for a totally
unrelated filesystem in the pool, but I have no idea if this could
have affected it.
(I've verified that I got the right one with zpool history.)

mike


On Tue, Jan 5, 2010 at 2:24 AM, Ian Collins i...@ianshome.com wrote:
 Michael Herf wrote:

 I replayed a bunch of filesystems in order to get dedupe benefits.
 Only thing is a couple of them are rolled back to November or so (and
 I didn't notice before destroy'ing the old copy).

 I used something like:

 zfs snapshot pool/f...@dd
 zfs send -Rp pool/f...@dd | zfs recv -d pool/fs2
 (after done...)
 zfs destroy pool/fs
 zfs rename pool/fs2/fs pool/fs

 What are the failure modes for partial send/recv? I've experienced
 full rollbacks when the process is canceled.
 But my case feels like the stream became truncated and the filesystem
 ended up partially built? Is this an expected result?


 Individual receives should be atomic.

 It does seem like ZFS needs a way to do this kind of operation
 atomically in the future, but I'm more interested in understanding if
 there's something I did wrong using the current tools, or if there are
 bugs.


 Was there any error output?  I always use -v on recursive receives to track
 progress.

 --
 Ian.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Mikko Lammi
Hello,

As a result of one badly designed application running loose for some time,
we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly now that
we need to get rid of them (because they eat 80% of disk space) it seems
to be quite challenging.

Traditional approaches like find ./ -exec rm {} \; seem to take forever
- after running several days, the directory size still says the same. The
only way how I've been able to remove something has been by giving rm
-rf to problematic directory from parent level. Running this command
shows directory size decreasing by 10,000 files/hour, but this would still
mean close to ten months (over 250 days) to delete everything!

I also tried to use unlink command to directory as a root, as a user who
created the directory, by changing directory's owner to root and so forth,
but all attempts gave Not owner error.

Any commands like ls -f or find will run for hours (or days) without
actually listing anything from the directory, so I'm beginning to suspect
that maybe the directory's data structure is somewhat damaged. Is there
some diagnostics that I can run with e.g zdb to investigate and
hopefully fix for a single directory within zfs dataset?

To make things even more difficult, this directory is located in rootfs,
so dropping the zfs filesystem would basically mean reinstalling the
entire system, which is something that we really wouldn't wish to go.


OS is Solaris 10, zpool version is 10 (rather old, I know, but is there
easy path for upgrade that might solve this problem?) and the zpool
consists two 146 GB SAS drivers in a mirror setup.


Any help would be appreciated.

Thanks,
Mikko

-- 
 Mikko Lammi | l...@lmmz.net | http://www.lmmz.net

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Joerg Schilling
Mikko Lammi mikko.la...@lmmz.net wrote:

 Hello,

 As a result of one badly designed application running loose for some time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

 Traditional approaches like find ./ -exec rm {} \; seem to take forever
 - after running several days, the directory size still says the same. The
 only way how I've been able to remove something has been by giving rm
 -rf to problematic directory from parent level. Running this command
 shows directory size decreasing by 10,000 files/hour, but this would still
 mean close to ten months (over 250 days) to delete everything!

Do you know the number of files where it really starts to become unusable slow?
I had firectories with 3 million files on UFS and this was just a bit slower
than with small directories.

BTW: find ./ -exec rm {} \; is definitely the wrong command as it is known 
since a long time to take forever. This is why find ./ -exec rm {} + was 
introduced 20 years ago.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Markus Kovero
Hi, while not providing complete solution, I'd suggest turning atime off so 
find/rm does not change access time and possibly destroying unnecessary 
snapshots before removing files, should be quicker.


Yours
Markus Kovero


-Original Message-
From: zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Mikko Lammi
Sent: 5. tammikuuta 2010 12:35
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] Clearing a directory with more than 60 million files

Hello,

As a result of one badly designed application running loose for some time,
we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly now that
we need to get rid of them (because they eat 80% of disk space) it seems
to be quite challenging.

Traditional approaches like find ./ -exec rm {} \; seem to take forever
- after running several days, the directory size still says the same. The
only way how I've been able to remove something has been by giving rm
-rf to problematic directory from parent level. Running this command
shows directory size decreasing by 10,000 files/hour, but this would still
mean close to ten months (over 250 days) to delete everything!

I also tried to use unlink command to directory as a root, as a user who
created the directory, by changing directory's owner to root and so forth,
but all attempts gave Not owner error.

Any commands like ls -f or find will run for hours (or days) without
actually listing anything from the directory, so I'm beginning to suspect
that maybe the directory's data structure is somewhat damaged. Is there
some diagnostics that I can run with e.g zdb to investigate and
hopefully fix for a single directory within zfs dataset?

To make things even more difficult, this directory is located in rootfs,
so dropping the zfs filesystem would basically mean reinstalling the
entire system, which is something that we really wouldn't wish to go.


OS is Solaris 10, zpool version is 10 (rather old, I know, but is there
easy path for upgrade that might solve this problem?) and the zpool
consists two 146 GB SAS drivers in a mirror setup.


Any help would be appreciated.

Thanks,
Mikko

-- 
 Mikko Lammi | l...@lmmz.net | http://www.lmmz.net

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Mike Gerdts
On Tue, Jan 5, 2010 at 4:34 AM, Mikko Lammi mikko.la...@lmmz.net wrote:
 Hello,

 As a result of one badly designed application running loose for some time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

 Traditional approaches like find ./ -exec rm {} \; seem to take forever
 - after running several days, the directory size still says the same. The
 only way how I've been able to remove something has been by giving rm
 -rf to problematic directory from parent level. Running this command
 shows directory size decreasing by 10,000 files/hour, but this would still
 mean close to ten months (over 250 days) to delete everything!

 I also tried to use unlink command to directory as a root, as a user who
 created the directory, by changing directory's owner to root and so forth,
 but all attempts gave Not owner error.

 Any commands like ls -f or find will run for hours (or days) without
 actually listing anything from the directory, so I'm beginning to suspect
 that maybe the directory's data structure is somewhat damaged. Is there
 some diagnostics that I can run with e.g zdb to investigate and
 hopefully fix for a single directory within zfs dataset?

In situations like this, ls will be exceptionally slow partially
because it will sort the output.  Find is slow because it needs to
call lstat() on every entry.  In similar situations I have found the
following to work.

perl -e 'opendir(D, .); while ( $d = readdir(D) ) { print $d\n }'

Replace print with unlink if you wish...


 To make things even more difficult, this directory is located in rootfs,
 so dropping the zfs filesystem would basically mean reinstalling the
 entire system, which is something that we really wouldn't wish to go.


 OS is Solaris 10, zpool version is 10 (rather old, I know, but is there
 easy path for upgrade that might solve this problem?) and the zpool
 consists two 146 GB SAS drivers in a mirror setup.


 Any help would be appreciated.

 Thanks,
 Mikko

 --
  Mikko Lammi | l...@lmmz.net | http://www.lmmz.net

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Michael Schuster

Mike Gerdts wrote:

On Tue, Jan 5, 2010 at 4:34 AM, Mikko Lammi mikko.la...@lmmz.net wrote:

Hello,

As a result of one badly designed application running loose for some time,
we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly now that
we need to get rid of them (because they eat 80% of disk space) it seems
to be quite challenging.

Traditional approaches like find ./ -exec rm {} \; seem to take forever
- after running several days, the directory size still says the same. The
only way how I've been able to remove something has been by giving rm
-rf to problematic directory from parent level. Running this command
shows directory size decreasing by 10,000 files/hour, but this would still
mean close to ten months (over 250 days) to delete everything!

I also tried to use unlink command to directory as a root, as a user who
created the directory, by changing directory's owner to root and so forth,
but all attempts gave Not owner error.

Any commands like ls -f or find will run for hours (or days) without
actually listing anything from the directory, so I'm beginning to suspect
that maybe the directory's data structure is somewhat damaged. Is there
some diagnostics that I can run with e.g zdb to investigate and
hopefully fix for a single directory within zfs dataset?


In situations like this, ls will be exceptionally slow partially
because it will sort the output. 


that's what '-f' was supposed to avoid, I'd guess.

Michael
--
Michael Schusterhttp://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-05 Thread Joerg Schilling
Juergen Nickelsen n...@jnickelsen.de wrote:

 joerg.schill...@fokus.fraunhofer.de (Joerg Schilling) writes:

  The netapps patents contain claims on ideas that I invented for my Diploma 
  thesis work between 1989 and 1991, so the netapps patents only describe 
  prior
  art. The new ideas introduced with wofs include the ideas on how to use 
  COW
  for filesystems and on how to find the most recent superblock on a COW 
  filesystem. The ieas for the latter method have been developed while 
  discussing the wofs structure with Casten Bormann at TU-Berlin.

 Would you perhaps be willing to share the text? Sounds quite
 interesting, especially to compare it with ZFS and with Netapp's
 introduction to WAFL that I read a while ago.

If you are interested in the text, this is here:

http://cdrecord.berlios.de/private/wofs.ps.gz   (this is the old original 
without images as they have
not been created with troff)
http://cdrecord.berlios.de/private/WoFS.pdf (this is a reformatted version
with images included).

If you like to see the program code, I am considering to make it available
at some time.


 (And I know that discussions with Carsten Bormann can
 result in remarkable results -- not that I would want to disregard
 your own part in these ideas. :-)

Yes, he is a really helpful discussion partner.

As a note: the basic ideas for implementing COW (such as inverting the tree 
structure in order to avoid to rewrite all directories upwards to the root 
directory in case that a nested file is updated, the idea to use generation 
nodes called G-nodes and the idea on how updated super blocks can be found)
have been invented by me. Carsten helped to develop a method that allows to 
define and locate extension areas for updated super blocks for the case when 
the primary super clock update area has become full. The latter idea is not 
needed on a hard disk based filesystem as hard disks allw to overwrite old 
superblock locations. On a WORM media, this is essential to make sure that 
the medium is usable for writing as long as there are unwritten blockd.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread David Magda
On Tue, January 5, 2010 05:34, Mikko Lammi wrote:

 As a result of one badly designed application running loose for some time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

How about creating a new data set, moving the directory into it, and then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Casper . Dik

On Tue, January 5, 2010 05:34, Mikko Lammi wrote:

 As a result of one badly designed application running loose for some time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

How about creating a new data set, moving the directory into it, and then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk

The move will create and remove the files; the remove by mv will be as
inefficient removing them one by one.

rm -rf would be at least as quick.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Mikko Lammi
On Tue, January 5, 2010 17:08, David Magda wrote:
 On Tue, January 5, 2010 05:34, Mikko Lammi wrote:

 As a result of one badly designed application running loose for some
 time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now
 that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

 How about creating a new data set, moving the directory into it, and then
 destroying it?

 Assuming the directory in question is /opt/MYapp/data:
   1. zfs create rpool/junk
   2. mv /opt/MYapp/data /rpool/junk/
   3. zfs destroy rpool/junk

Tried that as well. It's moving individual files to the new directory with
speed approx. 3,000/minute, so it's not any faster than anything that I
can apply directly to the original directory.

I also tried the perl script that does readdir() earlier (it's as slow as
any other application), and switched the zfs dataset parameter atime to
off, but that didn't had much effect either.

However when we deleted some other files from the volume and managed to
raise free disk space from 4 GB to 10 GB, the rm -rf directory method
started to perform significantly faster. Now it's deleting around 4,000
files/minute (240,000/h - quite an improvement from 10,000/h). I remember
that I saw some discussion related to ZFS performance when filesystem
becomes very full, so I wonder if that was the case here.

Next I'm going to try if that find ./ -exec {} + yelds any better
results than rm -rf from parent directory. But I guess at some point the
bottleneck will be just CPU (this is a 1-Ghz T1000 system) and disk I/O,
not the ZFS filesystem. I'm just wondering of what kind of figures to
expect.


regards,
Mikko

-- 
 Mikko Lammi | l...@lmmz.net | http://www.lmmz.net

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris 10 and ZFS dedupe status

2010-01-05 Thread Bob Friesenhahn

On Mon, 4 Jan 2010, Tony Russell wrote:

I am under the impression that dedupe is still only in OpenSolaris 
and that support for dedupe is limited or non existent.  Is this 
true?  I would like to use ZFS and the dedupe capability to store 
multiple virtual machine images.  The problem is that this will be 
in a production environment and would probably call for Solaris 10 
instead of OpenSolaris.  Are my statements on this valid or am I off 
track?


If dedup gets scheduled for Solaris 10 (I don't know), it would surely 
not be available until at least a year from now.


Dedup in OpenSolaris still seems risky to use other than for 
experimental purposes.  It has only recently become available.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread David Magda
On Tue, January 5, 2010 10:12, casper@sun.com wrote:

How about creating a new data set, moving the directory into it, and then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk

 The move will create and remove the files; the remove by mv will be as
 inefficient removing them one by one.

 rm -rf would be at least as quick.

Normally when you do a move with-in a 'regular' file system all that's
usually done is the directory pointer is shuffled around. This is not the
case with ZFS data sets, even though they're on the same pool?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Casper . Dik

On Tue, January 5, 2010 10:12, casper@sun.com wrote:

How about creating a new data set, moving the directory into it, and then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk

 The move will create and remove the files; the remove by mv will be as
 inefficient removing them one by one.

 rm -rf would be at least as quick.

Normally when you do a move with-in a 'regular' file system all that's
usually done is the directory pointer is shuffled around. This is not the
case with ZFS data sets, even though they're on the same pool?


Only within a single zfs you can rename files; but within a zpool but on 
different zfs's, you will need to copy and remove.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Dennis Clarke

 On Tue, January 5, 2010 10:12, casper@sun.com wrote:

How about creating a new data set, moving the directory into it, and
 then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk

 The move will create and remove the files; the remove by mv will be
 as
 inefficient removing them one by one.

 rm -rf would be at least as quick.

 Normally when you do a move with-in a 'regular' file system all that's
 usually done is the directory pointer is shuffled around. This is not the
 case with ZFS data sets, even though they're on the same pool?


You can also use star which may speed things up, safely.

star -copy -p -acl -sparse -dump -xdir -xdot -fs=96m -fifostats -time \
-C source_dir . destination_dir


that will buffer the transport of the data from source to dest via memory
and work to keep that buffer full as data is written on the output side.
Its probably at least as fast as mv and probably safer because you never
delete the original until after the copy is complete.


-- 
Dennis Clarke
dcla...@opensolaris.ca  - Email related to the open source Solaris
dcla...@blastwave.org   - Email related to open source for Solaris


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool destroy -f hangs system, now zpool import hangs system.

2010-01-05 Thread Carl Rathman
On Mon, Jan 4, 2010 at 8:59 PM, Richard Elling richard.ell...@gmail.com wrote:

 On Jan 4, 2010, at 6:40 AM, Carl Rathman wrote:

 I have a zpool raidz1 array (called storage) that I created under snv_118.

 I then created a zfs filesystem called storage/vmware which I shared
 out via iscsi.



 I then deleted the vmware filesystem, using 'zpool destroy -f
 storage/vmware' -- which resulted in heavy disk activity, and then
 hard locked the system after 10 minutes.

 I rebooted the machine, but was unable to boot. The machine would hang
 on Reading ZFS Configuration: - (the stick wouldn't even spin.)

 I was able to work around that by booting to a live CD, and deleting
 the zfs cache on my rpool.

 [clicked the wrong button]
 If you destroy the pool, then why try to import?
  -- richard

 zpool import sees my raidz1 array, but if I try 'zpool import -f
 storage', I get the same behavior of heavy disk activity for
 approximately 10 minutes, then a hard lock.


 Any clues on this one?

 Thanks,
 Carl
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



I didn't mean to destroy the pool.  I used zpool destroy on a zvol,
when I should have used zfs destroy.

When I used zpool destroy -f mypool/myvolume the machine hard locked
after about 20 minutes.

I don't want to destroy the pool, I just wanted to destroy the one
volume. -- Which is why I now want to import the pool itself. Does
that make sense?

Thanks,
Carl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread David Magda
On Tue, January 5, 2010 10:50, Michael Schuster wrote:
 David Magda wrote:
 Normally when you do a move with-in a 'regular' file system all that's
 usually done is the directory pointer is shuffled around. This is not
 the case with ZFS data sets, even though they're on the same pool?

 no - mv doesn't know about zpools, only about posix filesystems.

So the delineation of POSIX file systems is done at the data set layer,
and not at the zpool layer. (Which makes sense since the output of 'df'
tends to closely mimic the output of 'zfs list'.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Roch

Richard Elling writes:
  On Jan 3, 2010, at 11:27 PM, matthew patton wrote:
  
   I find it baffling that RaidZ(2,3) was designed to split a record- 
   size block into N (N=# of member devices) pieces and send the  
   uselessly tiny requests to spinning rust when we know the massive  
   delays entailed in head seeks and rotational delay. The ZFS-mirror  
   and load-balanced configuration do the obviously correct thing and  
   don't split records and gain more by utilizing parallel access. I  
   can't imagine the code-path for RAIDZ would be so hard to fix.
  
  Knock yourself out :-)
  http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c
  
   I've read posts back to 06 and all I see are lamenting about the  
   horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to  
   claw back performance by combining multiple such vDEVs. I understand  
   RAIDZ will never equal Mirroring, but it could get damn close if it  
   didn't break requests down and better yet utilized copies=N and  
   properly placed the copies on disparate spindles. This is somewhat  
   analogous to what the likes of 3PAR do and it's not rocket science.
  
  That is not the issue for small, random reads.  For all reads, the  
  checksum is
  verified. When you spread the record across multiple disks, then you  
  need
  to read the record back from those disks. In general, this means that as
  long as the recordsize is larger than the requested small read, then  
  your
  performance will approach the N/(N-P) * IOPS limit. At the  
  pathological edge,
  you can set recordsize to 512 bytes and you end up with mirroring (!)
  The small, random read performance model I developed only calculates
  the above IOPS limit, and does not consider recordsize.
  
  The physical I/O is much more difficult to correlate to the logical I/ 
  O because
  of all of the coalescing and caching that occurs at all of the lower  
  levels in
  the stack.
  
   An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount  
   of storage but the latter is a hell of a lot more resilient and max  
   IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still  
   be 1/2 the IOPS of the 8 disk mirror but I'd at least save a bundle  
   of coin in either reduced spindle count or using slower drives.
  
   With all the great things ZFS is capable of, why hasn't this been  
   redesigned long ago? what glaringly obvious truth am I missing?
  
  Performance, dependability, space: pick two.
-- richard
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


If store record X in one column like raid-5 or 6 does; then
you need to generate parity for that record X by grouping
with other unrelated records Y, Z, T etc. When X if freed in the
filesystem, it still holds parity information protecting Y,
Z, T so you can't get rid of what was stored @ X. If you try
to store new data in X and in associated parity by fail in
mid-stream you hit the raid-5 write hole. Moreover now
that X is not referenced in the filesystem, no more checksum
is associated with it and if bit rot occurs in X and disk
holding Y dies, resilvering would generate garbage for Y.

This seems to force use to chunk up disks with every unit
checksummed even if freed. Secure deletion becomes a problem
as well. And you need can end up madly searching for free
stripes, repositioning old blocks in partial striped even if
the pool is just 10% filled up.

Can one do this with raid-dp ?
http://blogs.sun.com/roch/entry/need_inodes


That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently. 

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Richard Elling

On Jan 5, 2010, at 2:34 AM, Mikko Lammi wrote:


Hello,

As a result of one badly designed application running loose for some  
time,

we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly  
now that
we need to get rid of them (because they eat 80% of disk space) it  
seems

to be quite challenging.

Traditional approaches like find ./ -exec rm {} \; seem to take  
forever
- after running several days, the directory size still says the  
same. The

only way how I've been able to remove something has been by giving rm
-rf to problematic directory from parent level. Running this command
shows directory size decreasing by 10,000 files/hour, but this would  
still

mean close to ten months (over 250 days) to delete everything!


This is, in part, due to stat() slowness.  Fixed in later OpenSolaris  
builds.

I have no idea if or when the fix will be backported to Solaris 10.

I also tried to use unlink command to directory as a root, as a  
user who
created the directory, by changing directory's owner to root and so  
forth,

but all attempts gave Not owner error.

Any commands like ls -f or find will run for hours (or days)  
without
actually listing anything from the directory, so I'm beginning to  
suspect
that maybe the directory's data structure is somewhat damaged. Is  
there

some diagnostics that I can run with e.g zdb to investigate and
hopefully fix for a single directory within zfs dataset?

To make things even more difficult, this directory is located in  
rootfs,

so dropping the zfs filesystem would basically mean reinstalling the
entire system, which is something that we really wouldn't wish to go.


How are the files named?  If you know something about the filename
pattern, then you could create subdirs and mv large numbers of files
to reduce the overall size of a single directory.  Something like:

mkdir .A
mv A* .A
mkdir .B
mv B* .B
...

Also, as previously noted, atime=off.

If you can handle a reboot, you can bump the size of the DNLC, which
might help also.  OTOH, if you can reboot you can also run the latest
b130 livecd which has faster stat().
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Joerg Schilling
Michael Schuster michael.schus...@sun.com wrote:

  rm -rf would be at least as quick.
  
  Normally when you do a move with-in a 'regular' file system all that's
  usually done is the directory pointer is shuffled around. This is not the
  case with ZFS data sets, even though they're on the same pool?

 no - mv doesn't know about zpools, only about posix filesystems.

mv first tries to rename(2) the file. If this does not succeed but results in 
EXDEV, it copies the file.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't export pool after zfs receive

2010-01-05 Thread David Dyer-Bennet

On Mon, January 4, 2010 13:51, Ross wrote:
 I initialized a new whole-disk pool on an external
 USB drive, and then did zfs send from my big data pool and zfs recv onto
 the
 new external pool.
 Sometimes this fails, but this time it completed.

 That's the key bit for me - zfs send /receive should not just fail at
 random.  It sounds like your problem is not just that you can't export the
 pool.

It's equally flaky in virtual environments, in my experience.  Sadly. 
Send / receive seems to not be ready for prime time yet.  (I had to give
up on incremental completely, since that was erroring out.)

 As Richard says, that sounds like bad hardware / drivers.  Something is
 causing problems for ZFS.

Always possible.  I paid twice or so what I usually pay for systems to buy
this one, including ECC memory and such, but none of that is any guaranty
I got it right.  Frustrating, though.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Casper . Dik

no - mv doesn't know about zpools, only about posix filesystems.

mv doesn't care about filesystems only about the interface provided by 
POSIX.

There is no zfs specific interface which allows you to move a file from
one zfs to the next.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool destroy -f hangs system, now zpool import hangs system.

2010-01-05 Thread Richard Elling

On Jan 5, 2010, at 7:54 AM, Carl Rathman wrote:


I didn't mean to destroy the pool.  I used zpool destroy on a zvol,
when I should have used zfs destroy.

When I used zpool destroy -f mypool/myvolume the machine hard locked
after about 20 minutes.


This would be a bug.  zpool destroy should only destroy pools.
Volumes are datasets and are destroyed by zfs destroy.  Using
zpool destroy -f will attempt to force unmounts of any mounted
datasets, but volumes are not mounted, per se. Upon reboot, nothing
will be mounted until after the pool is imported.



I don't want to destroy the pool, I just wanted to destroy the one
volume. -- Which is why I now want to import the pool itself. Does
that make sense?


If the pool was destroyed, then you can try to import using -D.

Are you sure you didn't zfs destroy instead?  Once the pool is  
imported,

zpool history will show all of the commands issued against the pool.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread David Dyer-Bennet

On Tue, January 5, 2010 10:01, Richard Elling wrote:
 OTOH, if you can reboot you can also run the latest
 b130 livecd which has faster stat().

How much faster is it?  He estimated 250 days to rm -rf them; so 10x
faster would get that down to 25 days, 100x would get it down to 2.5 days
(assuming the entire time is in the stat calls, which is probably not
totally true)

It's interesting how our ability to build larger disks, and our software's
ability to do things like create really large numbers of files, comes back
to bite us on the ass every now and then.

I hope he has a background process running chipping away at it; I don't
THINK 250 days in the background is going to turn out to be the best
answer, but one might as well start the clock running just in case.

Best answer might turn out to be to copy off the less than 20% good data
and just scrag the pool.  Inelegant, but might result in less downtime, or
in getting the space back much faster.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris 10 and ZFS dedupe status

2010-01-05 Thread Henrik Johansson

On Jan 5, 2010, at 4:38 PM, Bob Friesenhahn wrote:

 On Mon, 4 Jan 2010, Tony Russell wrote:
 
 I am under the impression that dedupe is still only in OpenSolaris and that 
 support for dedupe is limited or non existent.  Is this true?  I would like 
 to use ZFS and the dedupe capability to store multiple virtual machine 
 images.  The problem is that this will be in a production environment and 
 would probably call for Solaris 10 instead of OpenSolaris.  Are my 
 statements on this valid or am I off track?
 
 If dedup gets scheduled for Solaris 10 (I don't know), it would surely not be 
 available until at least a year from now.
 
 Dedup in OpenSolaris still seems risky to use other than for experimental 
 purposes.  It has only recently become available.

I've just wrote an entry about update 9,  I think it will contain zpool version 
19, so no dedup for this release if that's  correct.

Regards

Henrik
http://sparcv9.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-05 Thread Kjetil Torgrim Homme
Brad bene...@yahoo.com writes:

 Hi Adam,

I'm not Adam, but I'll take a stab at it anyway.

BTW, your crossposting is a bit confusing to follow, at least when using
gmane.org.  I think it is better to stick to one mailing list anyway?

 From your the picture, it looks like the data is distributed evenly
 (with the exception of parity) across each spindle then wrapping
 around again (final 4K) - is this one single write operation or two?

it is a single write operation per device.  actually, it may be less
than one write operation since the transaction group, which probably
contains many more updates, is written as a whole.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Richard Elling

On Jan 5, 2010, at 8:13 AM, David Dyer-Bennet wrote:


On Tue, January 5, 2010 10:01, Richard Elling wrote:

OTOH, if you can reboot you can also run the latest
b130 livecd which has faster stat().


How much faster is it?  He estimated 250 days to rm -rf them; so 10x
faster would get that down to 25 days, 100x would get it down to 2.5  
days

(assuming the entire time is in the stat calls, which is probably not
totally true)


dunno, nothing useful in the public bug report :-(
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6775100

It's interesting how our ability to build larger disks, and our  
software's
ability to do things like create really large numbers of files,  
comes back

to bite us on the ass every now and then.


Wait until you try it with dedup... not only will you need to update a  
lot

of metadata, but also a lot of DTT entries.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool destroy -f hangs system, now zpool import hangs system.

2010-01-05 Thread Carl Rathman
On Tue, Jan 5, 2010 at 10:12 AM, Richard Elling
richard.ell...@gmail.com wrote:
 On Jan 5, 2010, at 7:54 AM, Carl Rathman wrote:

 I didn't mean to destroy the pool.  I used zpool destroy on a zvol,
 when I should have used zfs destroy.

 When I used zpool destroy -f mypool/myvolume the machine hard locked
 after about 20 minutes.

 This would be a bug.  zpool destroy should only destroy pools.
 Volumes are datasets and are destroyed by zfs destroy.  Using
 zpool destroy -f will attempt to force unmounts of any mounted
 datasets, but volumes are not mounted, per se. Upon reboot, nothing
 will be mounted until after the pool is imported.


 I don't want to destroy the pool, I just wanted to destroy the one
 volume. -- Which is why I now want to import the pool itself. Does
 that make sense?

 If the pool was destroyed, then you can try to import using -D.

 Are you sure you didn't zfs destroy instead?  Once the pool is imported,
 zpool history will show all of the commands issued against the pool.
  -- richard



Hi Richard,

If I could import the pool, I'd love to do a history on it.

At this point, if I attempt to import the pool, the machine will have
heavy disk activity on the pool for approximately 10 minutes, then the
machine will hard lock. This will happen when I boot the machine from
its snv_130 rpool, or if I boot the machine from a snv_130 live cd.

Thanks,
Carl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering ZFS stops after syseventconfd can't fork

2010-01-05 Thread Cindy Swearingen

Hi Paul,

I opened 6914208 to cover the sysevent/zfsdle problem.

If the system crashed due to a power failure and the disk labels for
this pool were corrupted, then I think you will need to follow the steps 
to get the disks relabeled correctly. You might review some previous 
postings by Victor Latuskin that describe these steps.


Thanks,

Cindy

On 12/28/09 11:17, Paul Armstrong wrote:

Alas, even moving the file out of the way and rebooting the box (to guarantee 
state) didn't work:

-bash-4.0# zpool import -nfFX hds1
echo $?
-bash-4.0# echo $?
1

Do you need to be able to read all the labels for each disk in the array in 
order to recover?


From zdb -l on one of the disks:


LABEL 3

failed to unpack label 3

Thanks,
Paul

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Tim Cook
On Tue, Jan 5, 2010 at 11:25 AM, Richard Elling richard.ell...@gmail.comwrote:

 On Jan 5, 2010, at 8:13 AM, David Dyer-Bennet wrote:


 On Tue, January 5, 2010 10:01, Richard Elling wrote:

 OTOH, if you can reboot you can also run the latest
 b130 livecd which has faster stat().


 How much faster is it?  He estimated 250 days to rm -rf them; so 10x
 faster would get that down to 25 days, 100x would get it down to 2.5 days
 (assuming the entire time is in the stat calls, which is probably not
 totally true)


 dunno, nothing useful in the public bug report :-(
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6775100


  It's interesting how our ability to build larger disks, and our software's
 ability to do things like create really large numbers of files, comes back
 to bite us on the ass every now and then.


 Wait until you try it with dedup... not only will you need to update a lot
 of metadata, but also a lot of DTT entries.
  -- richard



I recall pointing this out over a year ago when I said claiming unlimited
snapshots and filesystems was disingenuous at best, and that likely we'd
need to see artificial limitations to make many of these features usable.
But I digress :)

-- 
--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Robert Milkowski

On 05/01/2010 16:00, Roch wrote:

That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.

   


Have you got any benchmarks available (comparing 512B to 4K to classical 
RAID-5)?


The problem is that while RAID-Z is really good for some workloads it is 
really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but for 
some workloads it won't (due to the huge size of a working set). In such 
environments RAID-Z2 offers much worse performance then similarly 
configured NetApp (RAID-DP, same number of disk drives). If ZFS would 
provide another RAID-5/RAID-6 like protection but with different 
characteristics so writing to a pool would be slower but reading from it 
would be much faster (comparable to RAID-DP) some customers would be 
very happy. Then maybe a new kind of cache device would be needed to 
buffer writes to NV storage to make writes faster (like HW arrays have 
been doing for years).



A possible *workaround* is to use SVM to set-up RAID-5 and create a zfs 
pool on top of it.

How does SVM handle R5 write hole? IIRC SVM doesn't offer RAID-6.

--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Daniel Rock

Am 05.01.2010 16:22, schrieb Mikko Lammi:

However when we deleted some other files from the volume and managed to
raise free disk space from 4 GB to 10 GB, the rm -rf directory method
started to perform significantly faster. Now it's deleting around 4,000
files/minute (240,000/h - quite an improvement from 10,000/h). I remember
that I saw some discussion related to ZFS performance when filesystem
becomes very full, so I wonder if that was the case here.


I did some tests. They were done on an Ultra 20 (2.2 GHz Dual-Core Opteron) 
with crappy SATA disks. On this machine creation and deletion of files were 
I/O bound. I was able to create about 1 Mio. files per hour. I stopped after 
5 hours, so I had approx. 5 Mio. files in one directory.


Deletion (via the Perl script) also had a rate of ~1 Mio. files per hour. 
During deletion the disks (mirrored zpool) were both 95% busy, CPU time was 
5% total.


If the T1000 has SCSI disks you can turn on write cache on both disks 
(though in my tests on delete most I/O were read operations). For the rpool 
it will probably not be enabled by default because your are just using 
partitions:


# format -e
[select disk]
format scsi
scsi p8 b2 |= 4

Mode select on page 8 ok.

scsi quit

Disable write cache:

scsi p8 b2 = ~4


(Yes I know, there is a cache command in format, but I'm used to above
commands a long time before the cache command was introduced)


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Joe Blount

On 01/ 5/10 10:01 AM, Richard Elling wrote:

How are the files named?  If you know something about the filename
pattern, then you could create subdirs and mv large numbers of files
to reduce the overall size of a single directory.  Something like:

mkdir .A
mv A* .A
mkdir .B
mv B* .B
...

I doubt that would be a faster option.  Unless you can be certain the 
file naming coincides with the unsorted order of the files in the 
directory.  Because if A* does not occur the beginning of the 
directory's contents, finding them will be painful.  The above process 
would add many cycles of scanning through all 60 million directory 
entries.  And each move request will churn the vnode cache.



A while back I did some experimenting with millions of files per 
directory.  Note that the time estimates are overstating how long it 
will take.  The more files you remove, the faster it will go.  I would 
be trying to get an unsorted read of the directory, and delete them in 
that order.  This is not just to save the time it takes to sort the 
output.  It will also mimimize vnode cache churn, and the time to remove 
each object.  Each remove request must iterate the directory looking for 
the object to remove. 

Newer ON builds support the -U option to ls, for unsorted output.  I 
don't know what may exist on S10.  FWIW, I copied the 'ls' binary from a 
ON128 machine to /tmp/myls a S10 machine, and it appeared to work - I 
don't know if there are any issues/risks with doing that.



Since its a niagara system, it might go faster if you can get multiple 
removes going in parallel.  But only if you all the parallel remove 
requests can be on files near the beginning of the directory's 
contents.If you can't get an unsorted list of files, then multiple 
threads will just add to the vnode cache thrashing.



It might be worth trying something like this:
ls -U  remove.sh
Make it a bash script.
prepend rm -f to each line, and append an  to each line.
Maybe every few hundred lines put in a wait.  (in case the rm's can be 
kicked off significantly faster than they can be completed, you don't 
want millions of rm's to get started)


You'll have to wait on ls to do one unsorted read of the directory.  
Then you will get parallel remove requests going, and always on files at 
the beginning of the directory.  There should be minimal vnode churn 
during the removes.


Starting the new processes for the removes may counteract the benefit of 
parallelizing, and make this slower.  But since its a Niagara system, 
you may have the spare cpu cycles to waste anyway.  Its just another 
idea to try...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Paul Gress

On 01/ 5/10 05:34 AM, Mikko Lammi wrote:

Hello,

As a result of one badly designed application running loose for some time,
we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly now that
we need to get rid of them (because they eat 80% of disk space) it seems
to be quite challenging.
   


I've been following this thread.  Would it be faster to do the reverse.  
Copy the 20% of disk then format then move the 20% back.


Paul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Michael Schuster

Paul Gress wrote:

On 01/ 5/10 05:34 AM, Mikko Lammi wrote:

Hello,

As a result of one badly designed application running loose for some time,
we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly now that
we need to get rid of them (because they eat 80% of disk space) it seems
to be quite challenging.
  


I've been following this thread.  Would it be faster to do the reverse.  
Copy the 20% of disk then format then move the 20% back.


I'm not sure the OS installation would survive that.

Michael
--
Michael Schusterhttp://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Fajar A. Nugraha
On Wed, Jan 6, 2010 at 12:44 AM, Michael Schuster
michael.schus...@sun.com wrote:
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.


 I've been following this thread.  Would it be faster to do the reverse.
  Copy the 20% of disk then format then move the 20% back.

 I'm not sure the OS installation would survive that.

... even when done from a live/rescue CD session?

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import -f not forceful enough?

2010-01-05 Thread Cindy Swearingen

Hi Dan,

Can you describe what you are trying to recover from with more details
because we can't quite follow what steps might have lead to this
scenario.

For example, your hdc pool had two disks, c1t0d0s0 and c8t1d0s0,
and your rpool has c8t0d0s0 so s8t0d0s0 cannot be wiped clean.

Maybe you mean c8t1d0s0 is now relabeled (with labelfix) and
reconnected for hdc but c1t0d0s0 is still corrupted so the hdc pool
cannot be re-imported is my guess...

Thanks,

Cindy


On 01/03/10 14:49, Dan McDonald wrote:

I had to use the labelfix hack (and I had to recompile it at that) on 1/2 of an 
old zpool.  I made this change:

/* zio_checksum(ZIO_CHECKSUM_LABEL, zc, buf, size); */
zio_checksum_table[ZIO_CHECKSUM_LABEL].ci_func[0](buf, size, zc);

and I'm assuming [0] is the correct endianness, since afterwards I saw it come up with 
zpool import.

Unfortunately, I can't import it.  Here's what happens:

# uname -a
SunOS neuromancer 5.11 snv_130 i86pc i386 i86pc
# zpool import
  pool: hdc
id: 18323387294498987089
 state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

hdc   FAULTED  corrupted data
  mirror-0DEGRADED
c1t0d0s0  FAULTED  corrupted data
c8t1d0s0  ONLINE
# zpool import -f hdc
cannot import 'hdc': one or more devices is currently unavailable
Destroy and re-create the pool from
a backup source.
# zpool status
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c8t0d0s0  ONLINE   0 0 0

errors: No known data errors
#

Note that c1t0d0s0 was on the old system, and that it's now the (wiped clean) 
c8t0d0s0.  Any clues are, as always, welcome.  I'd prefer not to restore my 
saved zfs-send streams, so I'd like to get the import of the old root pool 
(hdc) to work.

Thanks!
Dan McD.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import -f not forceful enough?

2010-01-05 Thread Dan McDonald
 Hi Dan,
 
 Can you describe what you are trying to recover from
 with more details
 because we can't quite follow what steps might have
 lead to this
 scenario.

Sorry.

I was running Nevada 103 with a root zpool called hdc with c1t0d0s0 and 
c1t1d0s0.

I first uttered:  zpool detach hdc c1t1d0s0.

I then detached that drive, installed OpenSolaris on what *was* c1t0d0s0.  Once 
I rebooted into OpenSolaris, I noticed that the drive had become c8t0d0s0.

Assuming the same remapping happened to the other drive, I plugged it back in, 
ran labelfix on c8t1d0s0, and now got to see that pool hdc assumes the mirror 
is c1t0d0s0 and c8t1d0s0.  I have no idea what c1t0d0s0 points to on the new 
OpenSolaris view of things, but it is definitely corrupt from the point of view 
of a ZFS mirrored pool.

So it sounds like I cannot whack c8t1d0s0 to be a pool all by itself and I 
should just give up.  Is that correct?

Thanks,
Dan
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread A Darren Dunham
On Tue, Jan 05, 2010 at 04:49:00PM +, Robert Milkowski wrote:
 A possible *workaround* is to use SVM to set-up RAID-5 and create a
 zfs pool on top of it.
 How does SVM handle R5 write hole? IIRC SVM doesn't offer RAID-6.

As far as I know, it does not address it.  It's possible that adding a
transaction volume would help by replaying anything that affected the
volume, but I don't know that sufficient information is present.

Symantec Volume Manager offers an explicit Raid5 log device.  There
doesn't appear to be any corresponding object in SVM.

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Roch Bourbonnais


Le 5 janv. 10 à 17:49, Robert Milkowski a écrit :


On 05/01/2010 16:00, Roch wrote:

That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.




Have you got any benchmarks available (comparing 512B to 4K to  
classical RAID-5)?


Using 8K 'soft' sector prototype on an otherwise plain raid-z layout,  
we got 8X more random reads than with 512B sectors; as would be  
expected.




The problem is that while RAID-Z is really good for some workloads  
it is really bad for others.


The bigger sector makes raid-z like mirroring for small records. And  
so performance of raid-z will be very good and it's also space  
efficient for large objects.


Sometimes having L2ARC might effectively mitigate the problem but  
for some workloads it won't (due to the huge size of a working set).  
In such environments RAID-Z2 offers much worse performance then  
similarly configured NetApp (RAID-DP, same number of disk drives).  
If ZFS would provide another RAID-5/RAID-6 like protection but with  
different characteristics so writing to a pool would be slower but  
reading from it would be much faster (comparable to RAID-DP) some  
customers would be very happy.


Agreed.

Then maybe a new kind of cache device would be needed to buffer  
writes to NV storage to make writes faster (like HW arrays have  
been doing for years).




Writes are not the problem and we have log device to offload them.  
It's really about maintaining integrity of raid-5 type layout in the  
presence of bit-rot even if such

bit-rot occur within free space.



A possible *workaround* is to use SVM to set-up RAID-5 and create a  
zfs pool on top of it.

How does SVM handle R5 write hole? IIRC SVM doesn't offer RAID-6.



It doesn't.



--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




smime.p7s
Description: S/MIME cryptographic signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Richard Elling

On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:


On 05/01/2010 16:00, Roch wrote:

That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.




Have you got any benchmarks available (comparing 512B to 4K to  
classical RAID-5)?


Not fair!  A 512 byte random write workload will absolutely clobber a
RAID-5 implementation. It is the RAID-5 pathological worst case.
For many arrays, even a 4 KB random write workload will suck most
heinously.

The raidz pathological worst case is a random read from many-column
raidz where files have records 128 KB in size.  The inflated read  
problem

is why it makes sense to match recordsize for fixed record workloads.
This includes CIFS workloads which use 4 KB records. It is also why
having many columns in the raidz for large records does not improve
performance. Hence the 3 to 9 raidz disk limit recommendation in the
zpool man page.

http://www.baarf.com

The problem is that while RAID-Z is really good for some workloads  
it is really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but  
for some workloads it won't (due to the huge size of a working set).  
In such environments RAID-Z2 offers much worse performance then  
similarly configured NetApp (RAID-DP, same number of disk drives).  
If ZFS would provide another RAID-5/RAID-6 like protection but with  
different characteristics so writing to a pool would be slower but  
reading from it would be much faster (comparable to RAID-DP) some  
customers would be very happy. Then maybe a new kind of cache device  
would be needed to buffer writes to NV storage to make writes faster  
(like HW arrays have been doing for years).


This still does not address the record checksum.  This is only a problem
for small, random read workloads, which means L2ARC is a good solution.
If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO
HDD and good performance do not belong in the same sentence anymore.
Game over -- SSDs win.

A possible *workaround* is to use SVM to set-up RAID-5 and create a  
zfs pool on top of it.

How does SVM handle R5 write hole? IIRC SVM doesn't offer RAID-6.


IIRC, SVM does a prewrite.  Dog slow.  Also, SVM is, AFAICT, on life  
support.
The source is out there if anyone wants to carry it forward. Actually,  
many of us

would be quite happy for SVM to fade from our memory :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-05 Thread Richard Elling


On Jan 4, 2010, at 7:08 PM, Brad wrote:


Hi Adam,

From your the picture, it looks like the data is distributed evenly  
(with the exception of parity) across each spindle then wrapping  
around again (final 4K) - is this one single write operation or two?


| P | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 | - 
one write op??
| P | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 | - 
one write op??


One physical write op per vdev because the columns will likely be
coalesced at the vdev.  Obviously, one physical write cannot span
multiple vdevs.


For a stripe configuration, is this would it would like look for 8K?

| D00 D01 D02 D03 D04 D05 D06 D07 D08 |
| D09 D10 D11 D12 D13 D14 D15 D16 D17 |


No.  It is very likely the entire write will be to one vdev.  Again,  
this is

dynamic striping, not RAID-0. RAID-0 is defined by SNIA as A disk array
data mapping technique in which fixed-length sequences of virtual disk
data addresses are mapped to sequences of member disk addresses
in a regular rotating pattern.  In ZFS, there is no fixed-length  
sequence.

The next column is chosen approximately every MB or so. You get the
benefit of sequential access to the media, with the stochastic spreading
across vdevs as well.

When you have multiple top-level vdevs, such as multiple mirrors or
multiple raidz sets, then you get the ~ 1MB spread across the top level
and the normal allocations in the sets.  In other words, any given  
record

should be in one set.  Again, this limits hyperspreading and allows you
to scale to very large numbers of disks.  It seems to work reasonably
well in practice. I attempted to describe this in pictures for my ZFS
tutorials.  You can be the judge, and suggestions are always welcome.
See slide 27 at
http://www.slideshare.net/relling/zfs-tutorial-usenix-lisa09-conference

[for the alias, I've only today succeeded in uploading the slides to
slideshare... been trying off and on for more than a month :-(]
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Richard Elling


On Jan 5, 2010, at 8:52 AM, Daniel Rock wrote:


Am 05.01.2010 16:22, schrieb Mikko Lammi:
However when we deleted some other files from the volume and  
managed to
raise free disk space from 4 GB to 10 GB, the rm -rf directory  
method
started to perform significantly faster. Now it's deleting around  
4,000
files/minute (240,000/h - quite an improvement from 10,000/h). I  
remember

that I saw some discussion related to ZFS performance when filesystem
becomes very full, so I wonder if that was the case here.


I did some tests. They were done on an Ultra 20 (2.2 GHz Dual-Core  
Opteron) with crappy SATA disks. On this machine creation and  
deletion of files were I/O bound. I was able to create about 1 Mio.  
files per hour. I stopped after 5 hours, so I had approx. 5 Mio.  
files in one directory.


Deletion (via the Perl script) also had a rate of ~1 Mio. files per  
hour. During deletion the disks (mirrored zpool) were both 95% busy,  
CPU time was 5% total.


If the T1000 has SCSI disks you can turn on write cache on both  
disks (though in my tests on delete most I/O were read operations).  
For the rpool it will probably not be enabled by default because  
your are just using partitions:


Good observation!  By default, rpool will not have write cache enabled.
It might make a difference to enable the write cache for this operation.
 -- richard



# format -e
[select disk]
format scsi
scsi p8 b2 |= 4

Mode select on page 8 ok.

scsi quit

Disable write cache:

scsi p8 b2 = ~4


(Yes I know, there is a cache command in format, but I'm used to  
above

commands a long time before the cache command was introduced)


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Robert Milkowski

On 05/01/2010 18:37, Roch Bourbonnais wrote:


Writes are not the problem and we have log device to offload them. 
It's really about maintaining integrity of raid-5 type layout in the 
presence of bit-rot even if such

bit-rot occur within free space.



How is it addressed in RAID-DP?


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Robert Milkowski

On 05/01/2010 18:49, Richard Elling wrote:
On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote: 


The problem is that while RAID-Z is really good for some workloads it 
is really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but for 
some workloads it won't (due to the huge size of a working set). In 
such environments RAID-Z2 offers much worse performance then 
similarly configured NetApp (RAID-DP, same number of disk drives). If 
ZFS would provide another RAID-5/RAID-6 like protection but with 
different characteristics so writing to a pool would be slower but 
reading from it would be much faster (comparable to RAID-DP) some 
customers would be very happy. Then maybe a new kind of cache device 
would be needed to buffer writes to NV storage to make writes faster 
(like HW arrays have been doing for years).


This still does not address the record checksum.  This is only a problem
for small, random read workloads, which means L2ARC is a good solution.
If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO
HDD and good performance do not belong in the same sentence anymore.
Game over -- SSDs win.



as I wrote - sometimes the working set is so big that L2ARC or not there 
is virtually no difference and it is not practical to deploy L2ARC 
several TBs in size or bigger. For such workload RAID-DP behaves much 
better (many small random reads, not that much writes).



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread David Dyer-Bennet

On Tue, January 5, 2010 10:25, Richard Elling wrote:
 On Jan 5, 2010, at 8:13 AM, David Dyer-Bennet wrote:

 It's interesting how our ability to build larger disks, and our
 software's
 ability to do things like create really large numbers of files,
 comes back
 to bite us on the ass every now and then.

 Wait until you try it with dedup... not only will you need to update a
 lot
 of metadata, but also a lot of DTT entries.

My data consists (by volume) almost entirely of bitmap photo images; I
don't think dedup is going to buy me much, so I'm not leaping into
experimenting with it.

Probably just as well; I don't think I have enough memory for it, either.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Tristan Ball

On 6/01/2010 3:00 AM, Roch wrote:

Richard Elling writes:
On Jan 3, 2010, at 11:27 PM, matthew patton wrote:
  
  I find it baffling that RaidZ(2,3) was designed to split a record-
  size block into N (N=# of member devices) pieces and send the
  uselessly tiny requests to spinning rust when we know the massive
  delays entailed in head seeks and rotational delay. The ZFS-mirror
  and load-balanced configuration do the obviously correct thing and
  don't split records and gain more by utilizing parallel access. I
  can't imagine the code-path for RAIDZ would be so hard to fix.
  
Knock yourself out :-)

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c
  
  I've read posts back to 06 and all I see are lamenting about the
  horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to
  claw back performance by combining multiple such vDEVs. I understand
  RAIDZ will never equal Mirroring, but it could get damn close if it
  didn't break requests down and better yet utilized copies=N and
  properly placed the copies on disparate spindles. This is somewhat
  analogous to what the likes of 3PAR do and it's not rocket science.
  

   

[snipped for space ]


That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.

-r
   

Sold! Let's do that then! :-)

Seriously - are there design or architectural reasons why this isn't 
done by default, or at least an option? Or is it just a no one's had 
time to implement yet thing?
I understand that 4K sectors might be less space efficient for lots of 
small files, but I suspect lots of us would happilly make that trade off!


Thanks,
Tristan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Richard Elling

On Jan 5, 2010, at 11:30 AM, Robert Milkowski wrote:

On 05/01/2010 18:49, Richard Elling wrote:

On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:


The problem is that while RAID-Z is really good for some workloads  
it is really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but  
for some workloads it won't (due to the huge size of a working  
set). In such environments RAID-Z2 offers much worse performance  
then similarly configured NetApp (RAID-DP, same number of disk  
drives). If ZFS would provide another RAID-5/RAID-6 like  
protection but with different characteristics so writing to a pool  
would be slower but reading from it would be much faster  
(comparable to RAID-DP) some customers would be very happy. Then  
maybe a new kind of cache device would be needed to buffer writes  
to NV storage to make writes faster (like HW arrays have been  
doing for years).


This still does not address the record checksum.  This is only a  
problem
for small, random read workloads, which means L2ARC is a good  
solution.
If L2ARC is a set of HDDs, then you could gain some advantage, but  
IMHO

HDD and good performance do not belong in the same sentence anymore.
Game over -- SSDs win.



as I wrote - sometimes the working set is so big that L2ARC or not  
there is virtually no difference and it is not practical to deploy  
L2ARC several TBs in size or bigger. For such workload RAID-DP  
behaves much better (many small random reads, not that much writes).


If you are doing small, random reads on dozens of TB of data, then  
you've

got a much bigger problem on your hands... kinda like counting grains of
sand on the beach during low tide :-).  Hopefully, you do not have to  
randomly
update that data because your file system isn't COW :-). Fortunately,  
most

workloads are not of that size and scope.

Since there are already 1 TB SSDs on the market, the only thing  
keeping the
HDD market alive is the low $/TB.  Moore's Law predicts that cost  
advantage

will pass.  SSDs are already the low $/IOPS winners.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Richard Elling

On Jan 5, 2010, at 11:56 AM, Tristan Ball wrote:

On 6/01/2010 3:00 AM, Roch wrote:

That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.

-r


Sold! Let's do that then! :-)

Seriously - are there design or architectural reasons why this isn't  
done by default, or at least an option? Or is it just a no one's  
had time to implement yet thing?


Waiting on hardware to be become widely available might be a long wait.
See also PSARC 2008/769
http://arc.opensolaris.org/caselog/PSARC/2008/769/inception.materials/design_doc

I understand that 4K sectors might be less space efficient for lots  
of small files, but I suspect lots of us would happilly make that  
trade off!


+1 (for better reliability, too!)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Bob Friesenhahn

On Tue, 5 Jan 2010, Richard Elling wrote:


Since there are already 1 TB SSDs on the market, the only thing keeping the
HDD market alive is the low $/TB.  Moore's Law predicts that cost advantage
will pass.  SSDs are already the low $/IOPS winners.


SSD vendors are still working to stabilize their designs.  Most of 
them seem to be unworthy for use in more than a laptop computer.  A 
number of computer vendors (e.g. Apple  Dell) who offered SSDs in 
their computers encountered an expectedly high rate of product 
failure.


According to Sun's own engineers, Moore's Law is very bad for 
enterprise SSDs.  FLASH devices built to very small geometries are 
more likely to wear out and forget.  Current design trends are moving 
in a direction which is contrary to the requirements of enterprise 
SSDs. See


  http://www.eetimes.com/showArticle.jhtml?articleID=219200284

Perhaps inovative designers like Suncast will figure out how to build 
reliable SSDs based on parts which are more likely to wear out and 
forget.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Tristan Ball



On 6/01/2010 7:19 AM, Richard Elling wrote:


If you are doing small, random reads on dozens of TB of data, then you've
got a much bigger problem on your hands... kinda like counting grains of
sand on the beach during low tide :-).  Hopefully, you do not have to 
randomly
update that data because your file system isn't COW :-). Fortunately, 
most

workloads are not of that size and scope.

Since there are already 1 TB SSDs on the market, the only thing 
keeping the
HDD market alive is the low $/TB.  Moore's Law predicts that cost 
advantage

will pass.  SSDs are already the low $/IOPS winners.
 -- richard

These workloads (small random reads over huge datasets) might be getting 
more common in some environments - because it seems to be what you get 
when you consolidate virtual machine storage.


We've got a moderately large number of Virtual Machines (a mix of 
Debian, Win2K Win2K3) running a very large set of applications, and our 
reads are all over the place! :-( I have to say I remain impressed at 
how well the ARC behaves, but even then our hit rate is often not 
wonderful.


I _dream_ about being able to afford to build out my entire storage from 
cheap/large SSD's. My guess would be that in about 2 years I'll be able 
to. One of the reasons we've essentially put a hold on buying 
enterprise storage or fast FC/SCSI disks. A large part of the 
justification for FC/SCSI disks is their performance, and they're going 
to be completely eclipsed within the lifetime of any serious mid-range 
to high end storage array. Until that day we're make do large sata 
drives, mirrored, with relatively high spindle counts to avoid long 
per-disk queues.


:-)

T

PS:  OK, I know other tier-1 storage vendors have started integrating 
SSD's as well, but they hadn't when we started out current round of 
storage upgrades, and I stilll think opensolaris+sata hdds+ssd's gives 
us a cleaner,cheaper and easier upgrade path than most tier-1 vendors 
can provide.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] (Practical) limit on the number of snapshots?

2010-01-05 Thread Juergen Nickelsen
Is there any limit on the number of snapshots in a file system?

The documentation -- manual page, admin guide, troubleshooting guide
-- does not mention any. That seems to confirm my assumption that is
is probably not a fixed limit, but there may still be a practical
one, just like there is no limit on the number of file systems in a
pool, but nobody would find having a million file systems practical.

I have tried to create a number of snapshots in a file system for a
few hours. An otherwise unloaded X4250 with a nearly empty RAID-Z2
pool of six builtin disks (146 GB, 10K rpm) managed to create a few
snapshots per second in an empty file system.

It had not visibly slowed down when it reached 36051 snapshots after
hours and I stopped it; to my surprise destroying the file system
(with all these snapshots in it) took about as long. With ``iostat
-xn 1'' I could see that the disk usage was still low, at about 13%
IIRC.

So 36000 snapshots in an empty file system is not a problem. Is it
different with a file system that is, say, to 70% full? Or on a
bigger pool? Or with a significantly larger number of snapshots,
say, a million? I am asking for real experience here, not for the
theory.

Regards, Juergen.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Robert Milkowski

On 05/01/2010 20:19, Richard Elling wrote:

On Jan 5, 2010, at 11:30 AM, Robert Milkowski wrote:

On 05/01/2010 18:49, Richard Elling wrote:

On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:


The problem is that while RAID-Z is really good for some workloads 
it is really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but 
for some workloads it won't (due to the huge size of a working 
set). In such environments RAID-Z2 offers much worse performance 
then similarly configured NetApp (RAID-DP, same number of disk 
drives). If ZFS would provide another RAID-5/RAID-6 like protection 
but with different characteristics so writing to a pool would be 
slower but reading from it would be much faster (comparable to 
RAID-DP) some customers would be very happy. Then maybe a new kind 
of cache device would be needed to buffer writes to NV storage to 
make writes faster (like HW arrays have been doing for years).


This still does not address the record checksum.  This is only a 
problem

for small, random read workloads, which means L2ARC is a good solution.
If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO
HDD and good performance do not belong in the same sentence anymore.
Game over -- SSDs win.



as I wrote - sometimes the working set is so big that L2ARC or not 
there is virtually no difference and it is not practical to deploy 
L2ARC several TBs in size or bigger. For such workload RAID-DP 
behaves much better (many small random reads, not that much writes).


If you are doing small, random reads on dozens of TB of data, then you've
got a much bigger problem on your hands... kinda like counting grains of
sand on the beach during low tide :-).  Hopefully, you do not have to 
randomly
update that data because your file system isn't COW :-). Fortunately, 
most

workloads are not of that size and scope.



Well, nevertheless some environments are like that (and no, I'm not 
speculating) and the truth is that NetApp with RAID-DP with the same 
amount of disk drives proven to be faster than RAID-Z2 even with a help 
of SSDs as L2ARC. The point is that NetApp allowed to provide the 
capacity of RAID-6 and protection of dual parity while providing better 
performance to RAID-Z2 in the environment.
In other workloads RAIDZ-2 will be better, but not in this particular 
environment.


All I'm saying is that having yet another RAID type in ZFS which offers 
capacity similar to RAID-5/RAID-6 but with different performance 
characteristics so small random reads are on par with RAID-DP while 
sacrificing write performance would be beneficial for some environments.


RAID-Z with bigger sector size could improve performance but provided 
capacity could be much less than RAID-5/6 so it not necessary might be 
an apple-to-apple comparison (but still useful for some environments).


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Robert Milkowski

On 05/01/2010 20:19, Richard Elling wrote:

[...] Fortunately, most
workloads are not of that size and scope.



Forgot to mention it in my last email - yes, I agree. The environment 
I'm talking about is rather unusual and in most other cases where 
RAID-5/6 was considered the performance of RAID-Z1/2 was good enough or 
even better.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-05 Thread Miles Nordin
 dm == David Magda dma...@ee.ryerson.ca writes:

dm 4096 - to-512 blocks

aiui NAND flash has a minimum write size (determiined by ECC OOB bits)
of 2 - 4kB, and a minimum erase size that's much larger.  Remapping
cannot abstract away the performance implication of the minimum write
size if you are doing a series of synchronous writes smaller than the
minimum size on a device with no battery/capacitor, although using a
DRAM+supercap prebuffer might be able to abstract away some of it.


pgp7ymX3mE7r4.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Michael Herf
Many large-scale photo hosts start with netapp as the default good
enough way to handle multiple-TB storage. With a 1-5% cache on top,
the workload is truly random-read over many TBs. But these workloads
almost assume a frontend cache to take care of hot traffic, so L2ARC
is just a nice implementation of that, not a silver bullet.

I agree that RAID-DP is much more scalable for reads than RAIDZx, and
this basically turns into a cost concern at scale.

The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be
used instead of netapp. But this certainly reduces the cost advantage
significantly.

mike

p.s. I managed the team that built blogger.com's photo hosting, and
picasaweb.google.com, so I've seen some of this stuff at scale
(neither of these use netapp). For large photos, it's pretty simple:
the more independent spindles, the better.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-05 Thread Wes Felter

Eric D. Mudama wrote:

On Mon, Jan  4 at 16:43, Wes Felter wrote:

Eric D. Mudama wrote:


I am not convinced that a general purpose CPU, running other software
in parallel, will be able to be timely and responsive enough to
maximize bandwidth in an SSD controller without specialized hardware
support.


Fusion-io would seem to be a counter-example, since it uses a fairly 
simple controller (I guess the controller still performs ECC and maybe 
XOR) and the driver eats a whole x86 core. The result is very high 
performance.


I see what you're saying, but it isn't obvious (to me) how well
they're using all the hardware at hand.  2GB/s of bandwidth over their
PCI-e link and what looks like a TON of NAND, with a nearly-dedicated
x86 core...  resuting in 600MB/s or something like that?


Actually it's 600-700MB/s out of a 1+1GB/s slot or 1.5GB/s with two 
cards in a 2+2GB/s slot. I suspect that's pretty close to the PCIe 
limit. IIRC they have 22 NAND channels at 40MB/s (theoretical peak) 
each, which is 880MB/s. I agree that their CPU efficiency is not great, 
but cores are supposed to be cheap these days.


Wes Felter

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Robert Milkowski

On 05/01/2010 23:31, Michael Herf wrote:

The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be
used instead of netapp. But this certainly reduces the cost advantage
significantly.
   


This is true to some extent. I didn't want to bring it up as I wanted to 
focus only on technical aspect.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread David Magda

On Jan 5, 2010, at 16:06, Bob Friesenhahn wrote:

Perhaps inovative designers like Suncast will figure out how to  
build reliable SSDs based on parts which are more likely to wear out  
and forget.


At which point we'll probably start seeing the memristor start making  
an appearance in various devices. :)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Practical) limit on the number of snapshots?

2010-01-05 Thread Ian Collins

Juergen Nickelsen wrote:

Is there any limit on the number of snapshots in a file system?

The documentation -- manual page, admin guide, troubleshooting guide
-- does not mention any. That seems to confirm my assumption that is
is probably not a fixed limit, but there may still be a practical
one, just like there is no limit on the number of file systems in a
pool, but nobody would find having a million file systems practical.

I have tried to create a number of snapshots in a file system for a
few hours. An otherwise unloaded X4250 with a nearly empty RAID-Z2
pool of six builtin disks (146 GB, 10K rpm) managed to create a few
snapshots per second in an empty file system.

It had not visibly slowed down when it reached 36051 snapshots after
hours and I stopped it; to my surprise destroying the file system
(with all these snapshots in it) took about as long. With ``iostat
-xn 1'' I could see that the disk usage was still low, at about 13%
IIRC.

So 36000 snapshots in an empty file system is not a problem. Is it
different with a file system that is, say, to 70% full? Or on a
bigger pool? Or with a significantly larger number of snapshots,
say, a million? I am asking for real experience here, not for the
theory.

  
The most I ever had was about 24 on a 2TB pool (~1000 filesystems, x 
60 days x 4 per day).  There wasn't any noticeable performance impact, 
except when I built a tree of snapshots (via libzfs) to work out which 
ones had to be replicated.


Deleting 50 days worth of them took a very long time!

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2010-01-05 Thread R.G. Keen
One reason I was so interested in this issue was the double-price of raid 
enabled disks. 

However, I realized that I am doing the initial proving, not production - even 
if personal - of the system I'm building. So for that purpose, an array of 
smaller and cheaper disks might be good. 

In the process of looking at that, I found that geeks.com has Seagate 750GB 
Barracuda ES2 drives for $58 each if you'll put up with them being factory 
recertified and only warrantied for six months. 

Not great, I don't trust refurbished or recertified anything with archival 
data; but it's a test system. So I grabbed six of them for the initial build. 

This gives me a way to compare them against desktop systems in an array. May 
take a while but I can dig some of the issues out.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device

2010-01-05 Thread Tristan Ball
For those searching list archives, the SNV125-S2/40GB given below is not
based on the Intel controller.

I queried Kingston directly about this because there appears to be so
much confusion (and I'm considering using these drives!), and I got back
that:

The V series uses a JMicron Controller
The V+ series uses a Samsung Controller
The M and E series are the Intel Drives

There will be G2 versions of the V and V+ series out shortly, at least
one of which will be based on a Toshiba controller.

Part number prefixes are:

V series: SNV125-S2
V+ Series: SNV225-S2
M Series: SNM225-S2
E Series: SNE125-S2

http://www.kingston.com/anz/ssd/default.asp


It does seem that there are _lots_ of 3rd party websites that claim the
variations of the SNV* parts are Intel based drives, however that's not
what Kingston's rep told me, and it's not what's on their website.

Regards,
Tristan

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Eric D. Mudama
Sent: Tuesday, 5 January 2010 7:35 PM
To: Thomas Burgess
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] need a few suggestions for a poor man's
ZIL/SLOG device

On Mon, Jan  4 at 22:01, Thomas Burgess wrote:
   I guess i got some bad advice then
   I was told the kingston snv125-s2 used almost the exact same
hardware as
   an x25-m and should be considered the poor mans x25-m
...

   Right, i couldn't find any of the 40 gb's in stock so i ordered the
64
   gb.same exact model, only biggerdoes your previous statement
about
   the larger model ssd's not apply to the kingstons?

The SNV125-S2/40GB is the half an X25-M drive which can be often
found as a bare OEM drive for about $85 w/ rebate.

Kingston does sell rebranded Intel SLC drives as well, but under a
different model number: SNE-125S2/32 or SNE-125S2/64.  I don't believe
the 64GB Kingston MLC (SNV-125S2/64) is based on Intel's controller.

The Kingston rebranding of the gen2 intel MLC design is
SNM-125S2B/80 or SNM-125S2B/160.  Those are essentially 34nm Intel
X25-M units I believe.

--eric


-- 
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2010-01-05 Thread Erik Trimble
As a further update, I went back and re-read my SSD controller info, and 
then did some more Googling.


Turns out, I'm about a year behind on State-of-the-SSD.Eric is 
correct on the way current SSDs implement writes (both SLC and MLC),  so 
I'm issuing a mea-cupla here. The change in implementation appears to 
occur sometime shortly after the introduction of the Indilinx 
controllers.  My fault for not catching this.


-Erik



Eric D. Mudama wrote:

On Sat, Jan  2 at 22:24, Erik Trimble wrote:
In MLC-style SSDs, you typically have a block size of 2k or 4k. 
However, you have a Page size of several multiples of that, 128k 
being common, but by no means ubiquitous.


I believe your terminology is crossed a bit.  What you call a block is
usually called a sector, and what you call a page is known as a block.

Sector is (usually) the unit of reading from the NAND flash.

The unit of write in NAND flash is the page, typically 2k or 4k
depending on NAND generation, and thus consisting of 4-8 ATA sectors
(typically).  A single page may be written at a time.  I believe some
vendors support partial-page programming as well, allowing a single
sector append type operation where the previous write left off.

Ordered pages are collected into the unit of erase, which is known as
a block (or erase block), and is anywhere from 128KB to 512KB or
more, depending again on NAND generation, manufacturer, and a bunch of
other things.

Some large number of blocks are grouped by chip enables, often 4K or
8K blocks.


I think you're confusing erasing with writing.

When I say minimum write size, I mean that for an MLC, no matter 
how small you make a change, the minimum amount of data actually 
being written to the SSD is a full page (128k in my example).   There


Page is the unit of write, but it's much smaller in all NAND I am
aware of.

is no append down at this level. If I have a page of 128k, with 
data in 5 of the 4k blocks, and I then want to add another 2k of data 
to this, I have to READ all 5 4k blocks into the controller's DRAM, 
add the 2k of data to that, then write out the full amount to a new 
page (if available), or wait for a older page to be erased before 
writing to it.  Thus, in this case,  in order to do an actual 2k 
write, the SSD must first read 10k of data, do some compositing, then 
write 12k to a fresh page.


Thus, to change any data inside a single page, then entire contents 
of that page have to be read, the page modified, then the entire page 
written back out.


See above.

What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs 
work differently, but still have problems with what I'll call 
excess-writing.


I think you're only describing dumb SSDs with erase-block granularity
mapping. Most (all) vendors have moved away from that technique since
random write performance is awful in those designs and they fall over
dead from wAmp in a jiffy.

SLC and MLC NAND is similar, and they are read/written/erased almost
identically by the controller.

I'm not sure that SSDs actually _have_ to erase - they just overwrite 
anything there with new data. But this is implementation dependent, 
so I can say how /all/ MLC SSDs behave.


Technically you can program the same NAND page repeatedly, but since
bits can only transition from 1-0 on a program operation, the result
wouldn't be very meaningful.  An erase sets all the bits in the block
to 1, allowing you to store your data.

Once again, what I'm talking about is a characteristic of MLC SSDs, 
which are used in most consumer SSDS (the Intel X25-M, included).


Sure, such an SSD will commit any new writes to pages drawn from the 
list of never before used NAND.  However, at some point, this list 
becomes empty.  In most current MLC SSDs, there's about 10% extra 
(a 60GB advertised capacity is actually ~54GB usable with 6-8GB 
extra).   Once this list is empty, the SSD has to start writing 
back to previous used pages, which may require an erase step first 
before any write. Which is why MLC SSDs slow down drastically once 
they've been fulled to capacity several times.


From what I've seen, erasing a block typically takes a time in the
same scale as programming an MLC page, meaning in flash with large
page counts per block, the % of time spent erasing is not very large.

Lets say that an erase took 100ms and a program took 10ms, in an MLC
NAND device with 100 pages per block.  In this design, it takes us 1s
to program the entire block, but only 1/10 of the time to erase it.
An infinitely fast erase would only make the design about 10% faster.

For SLC the erase performance matters more since page writes are much
faster on average and there are half as many pages, but we were
talking MLC.

The performance differences seen is because they were artificially
fast to begin with because they were empty.  It's similar to
destroking a rotating drive in many ways to speed seek times.  Once
the drive is full, it all comes down to raw NAND performance,

Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-05 Thread Erik Trimble

Wes Felter wrote:

Eric D. Mudama wrote:

On Mon, Jan  4 at 16:43, Wes Felter wrote:

Eric D. Mudama wrote:


I am not convinced that a general purpose CPU, running other software
in parallel, will be able to be timely and responsive enough to
maximize bandwidth in an SSD controller without specialized hardware
support.


Fusion-io would seem to be a counter-example, since it uses a fairly 
simple controller (I guess the controller still performs ECC and 
maybe XOR) and the driver eats a whole x86 core. The result is very 
high performance.


I see what you're saying, but it isn't obvious (to me) how well
they're using all the hardware at hand.  2GB/s of bandwidth over their
PCI-e link and what looks like a TON of NAND, with a nearly-dedicated
x86 core...  resuting in 600MB/s or something like that?


Actually it's 600-700MB/s out of a 1+1GB/s slot or 1.5GB/s with two 
cards in a 2+2GB/s slot. I suspect that's pretty close to the PCIe 
limit. IIRC they have 22 NAND channels at 40MB/s (theoretical peak) 
each, which is 880MB/s. I agree that their CPU efficiency is not 
great, but cores are supposed to be cheap these days.


Wes Felter



The single Fusion-IO card is a 4x PCI-E 1.1 interface, which means about 
1GB/s max throughput. The Fusion-IO Duo is a 8x PCI-E 2.0 interface, 
which tops out at about 4GB/s.   So, it looks like the single card is at 
least a major fraction of the max throughput of the interface, while the 
Duo card still has plenty of headroom. 

I see the single Fusion-IO card eat about 1/4 the CPU power that a 
8Gbit  Fibre Channel card HBA does, and roughly the same as a 10Gbit 
Ethernet card.  So, it's not out of line with comparable throughput 
add-in cards.  It does need significantly more CPU than a SAS or SCSI 
controller, though.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device

2010-01-05 Thread Eric D. Mudama

On Wed, Jan  6 at 14:56, Tristan Ball wrote:

For those searching list archives, the SNV125-S2/40GB given below is not
based on the Intel controller.

I queried Kingston directly about this because there appears to be so
much confusion (and I'm considering using these drives!), and I got back
that:

The V series uses a JMicron Controller
The V+ series uses a Samsung Controller
The M and E series are the Intel Drives


I'm 99.999% sure that the 40GB V drive is based on the Intel
architecture and that the kingston rep was wrong.

http://www.anandtech.com/storage/showdoc.aspx?i=3667p=4

The label in the picture clearly shows a bare letter 'V' and 40GB
markings, and the board/layout is identical to the X25-M with only 5
NAND TSOPs instead of 10 or 20.

Either way though, the Intel-branded 40GB MLC drive (X25-V) is now
available on newegg as well:

http://www.newegg.com/Product/Product.aspx?Item=N82E16820167025

Currently $129.99 in retail packaging (with aluminum sled so it can
bolt into a 3.5 drive bay, etc.), no idea if they'll sell 'em for the
$85 that newegg briefly had the kingston branded version.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device

2010-01-05 Thread Thomas Burgess
I think the confusing part is that the 64gb version seems to use a different
controller all together
I couldn't find any SNV125-S2/40's in stock so i got 3 SNV125-S2/64's
thinking it would be the same,m only bigger.looks like it was stupid on
my part.

now i understand why i got such a good deal.
well i have yet to try them...maybe they won't be so bad...on newegg they
get a lot of good ratings.
either way i doubt using them for the rpool will hurt me...just a little
more expensive than the compact flash cards i was going to get.


On Wed, Jan 6, 2010 at 12:57 AM, Eric D. Mudama
edmud...@bounceswoosh.orgwrote:

 On Wed, Jan  6 at 14:56, Tristan Ball wrote:

 For those searching list archives, the SNV125-S2/40GB given below is not
 based on the Intel controller.

 I queried Kingston directly about this because there appears to be so
 much confusion (and I'm considering using these drives!), and I got back
 that:

 The V series uses a JMicron Controller
 The V+ series uses a Samsung Controller
 The M and E series are the Intel Drives


 I'm 99.999% sure that the 40GB V drive is based on the Intel
 architecture and that the kingston rep was wrong.

 http://www.anandtech.com/storage/showdoc.aspx?i=3667p=4

 The label in the picture clearly shows a bare letter 'V' and 40GB
 markings, and the board/layout is identical to the X25-M with only 5
 NAND TSOPs instead of 10 or 20.

 Either way though, the Intel-branded 40GB MLC drive (X25-V) is now
 available on newegg as well:

 http://www.newegg.com/Product/Product.aspx?Item=N82E16820167025

 Currently $129.99 in retail packaging (with aluminum sled so it can
 bolt into a 3.5 drive bay, etc.), no idea if they'll sell 'em for the
 $85 that newegg briefly had the kingston branded version.


 --eric


 --
 Eric D. Mudama
 edmud...@mail.bounceswoosh.org

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS filesystem size mismatch

2010-01-05 Thread Nils K . Schøyen
A ZFS file system reports 1007GB beeing used (df -h / zfs list). When doing a 
'du -sh' on the filesystem root, I only get appr. 300GB which is the correct 
size.

The file system became full during Christmas and I increased the quota from 1 
to 1.5 to 2TB and then decreased to 1.5TB. No reservations. Files and processes 
that filled up the file systems have been removed/stopped.

Server: Sun Fire V240 with Solaris 10 10/08 Sparc. iSCSI-connected storage.

Size on other zfs file systems on this server are correctly reported.

See output below.

Any ideas??

iscsi-roskva# ls -la /ndc
total 57
drwxr-xr-x  11 root root  11 Jan  5 13:33 .
drwxr-xr-x  30 root root  38 Jan  6 07:17 ..
drwxr-xr-x   2 root other 12 Nov 29 22:59 TT_DB
drwxr-xr-x  66 1050 11000112 Jan  5 21:20 cssdata
drwxr-xr-x   4 206  26 4 Mar 24  2009 dbsave
drwxr-xr-x  33 1101711000 60 Oct  8 13:18 infomap
drwxr-xr-x   8 root other  8 Mar  6  2008 operations
drwxr-xr-x  48 1100011000 90 Jan  5 16:11 programs
drwxr-xr-x  12 root other 13 Aug 10 10:48 projects
drwxr-xr-x  21 1100111000 51 Sep 15 12:00 request
drwxr-xr-x  32 1100511000 45 Jan  5 11:02 stations

iscsi-roskva# du -sh /ndc/*
  24K   TT_DB
 6.5G   cssdata
 4.4G   dbsave
 535M   infomap
  71G   operations
  46G   programs
  79G   projects
 6.7G   request
  70G   stations

iscsi-roskva# df -h /ndc
Filesystem size   used  avail capacity  Mounted on
storagepool/ndc1.5T  1007G   529G66%/ndc

iscsi-roskva# zfs get all storagepool/ndc
NAME PROPERTY VALUE  SOURCE
storagepool/ndc  type filesystem -
storagepool/ndc  creation Thu Jul 30 15:14 2009  -
storagepool/ndc  used 1007G  -
storagepool/ndc  available529G   -
storagepool/ndc  referenced   1007G  -
storagepool/ndc  compressratio1.00x  -
storagepool/ndc  mounted  yes-
storagepool/ndc  quota1.50T  local
storagepool/ndc  reservation  none   default
storagepool/ndc  recordsize   128K   default
storagepool/ndc  mountpoint   /ndc   local
storagepool/ndc  sharenfs rw,root=hugin  local
storagepool/ndc  checksum on default
storagepool/ndc  compression  offdefault
storagepool/ndc  atimeoffinherited from 
storagepool
storagepool/ndc  devices  on default
storagepool/ndc  exec on default
storagepool/ndc  setuid   on default
storagepool/ndc  readonly offdefault
storagepool/ndc  zonedoffdefault
storagepool/ndc  snapdir  hidden default
storagepool/ndc  aclmode  groupmask  default
storagepool/ndc  aclinherit   restricted default
storagepool/ndc  canmount on default
storagepool/ndc  shareiscsi   offdefault
storagepool/ndc  xattron default
storagepool/ndc  copies   1  default
storagepool/ndc  version  3  -
storagepool/ndc  utf8only off-
storagepool/ndc  normalizationnone   -
storagepool/ndc  casesensitivity  sensitive  -
storagepool/ndc  vscanoffdefault
storagepool/ndc  nbmand   offdefault
storagepool/ndc  sharesmb offdefault
storagepool/ndc  refquota none   default
storagepool/ndc  refreservation   none   default
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss