date:20100105

Chris Du dilid...@gmail.com wrote:

 You can use the utility to erase all blocks and regain performance, but it's 
 a manual process and quite complex. Windows 7 support TRIM, if SSD firmware 
 also supports it, the process is run in the background so you will not notice 
 performance degrade. I don't think any other OS supports TRIM.

IIRC, Lnux alsi supports the TRIM command.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] send/recv, apparent data loss

2010-01-05 Thread Ian Collins


Michael Herf wrote:

I replayed a bunch of filesystems in order to get dedupe benefits.
Only thing is a couple of them are rolled back to November or so (and
I didn't notice before destroy'ing the old copy).

I used something like:

zfs snapshot pool/f...@dd
zfs send -Rp pool/f...@dd | zfs recv -d pool/fs2
(after done...)
zfs destroy pool/fs
zfs rename pool/fs2/fs pool/fs

What are the failure modes for partial send/recv? I've experienced
full rollbacks when the process is canceled.
But my case feels like the stream became truncated and the filesystem
ended up partially built? Is this an expected result?
  

Individual receives should be atomic.


It does seem like ZFS needs a way to do this kind of operation
atomically in the future, but I'm more interested in understanding if
there's something I did wrong using the current tools, or if there are
bugs.
  
Was there any error output?  I always use -v on recursive receives to 
track progress.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] send/recv, apparent data loss

2010-01-05 Thread Michael Herf

I didn't use -v, so I don't know.
I just waited until the process exited, assuming it would succeed or
fail. The sizes looked equivalent, so I went ahead with the destroy,
rename.

For the jobs a couple weeks ago, I turned off the snapshot service.
For this one, I probably left it on. Anything possible there?

The only other thing is that I did zfs rollback for a totally
unrelated filesystem in the pool, but I have no idea if this could
have affected it.
(I've verified that I got the right one with zpool history.)

mike


On Tue, Jan 5, 2010 at 2:24 AM, Ian Collins i...@ianshome.com wrote:
 Michael Herf wrote:

 I replayed a bunch of filesystems in order to get dedupe benefits.
 Only thing is a couple of them are rolled back to November or so (and
 I didn't notice before destroy'ing the old copy).

 I used something like:

 zfs snapshot pool/f...@dd
 zfs send -Rp pool/f...@dd | zfs recv -d pool/fs2
 (after done...)
 zfs destroy pool/fs
 zfs rename pool/fs2/fs pool/fs

 What are the failure modes for partial send/recv? I've experienced
 full rollbacks when the process is canceled.
 But my case feels like the stream became truncated and the filesystem
 ended up partially built? Is this an expected result?


 Individual receives should be atomic.

 It does seem like ZFS needs a way to do this kind of operation
 atomically in the future, but I'm more interested in understanding if
 there's something I did wrong using the current tools, or if there are
 bugs.


 Was there any error output?  I always use -v on recursive receives to track
 progress.

 --
 Ian.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Mikko Lammi

Hello,

As a result of one badly designed application running loose for some time,
we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly now that
we need to get rid of them (because they eat 80% of disk space) it seems
to be quite challenging.

Traditional approaches like find ./ -exec rm {} \; seem to take forever
- after running several days, the directory size still says the same. The
only way how I've been able to remove something has been by giving rm
-rf to problematic directory from parent level. Running this command
shows directory size decreasing by 10,000 files/hour, but this would still
mean close to ten months (over 250 days) to delete everything!

I also tried to use unlink command to directory as a root, as a user who
created the directory, by changing directory's owner to root and so forth,
but all attempts gave Not owner error.

Any commands like ls -f or find will run for hours (or days) without
actually listing anything from the directory, so I'm beginning to suspect
that maybe the directory's data structure is somewhat damaged. Is there
some diagnostics that I can run with e.g zdb to investigate and
hopefully fix for a single directory within zfs dataset?

To make things even more difficult, this directory is located in rootfs,
so dropping the zfs filesystem would basically mean reinstalling the
entire system, which is something that we really wouldn't wish to go.


OS is Solaris 10, zpool version is 10 (rather old, I know, but is there
easy path for upgrade that might solve this problem?) and the zpool
consists two 146 GB SAS drivers in a mirror setup.


Any help would be appreciated.

Thanks,
Mikko

-- 
 Mikko Lammi | l...@lmmz.net | http://www.lmmz.net

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

Mikko Lammi mikko.la...@lmmz.net wrote:

 Hello,

 As a result of one badly designed application running loose for some time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

 Traditional approaches like find ./ -exec rm {} \; seem to take forever
 - after running several days, the directory size still says the same. The
 only way how I've been able to remove something has been by giving rm
 -rf to problematic directory from parent level. Running this command
 shows directory size decreasing by 10,000 files/hour, but this would still
 mean close to ten months (over 250 days) to delete everything!

Do you know the number of files where it really starts to become unusable slow?
I had firectories with 3 million files on UFS and this was just a bit slower
than with small directories.

BTW: find ./ -exec rm {} \; is definitely the wrong command as it is known 
since a long time to take forever. This is why find ./ -exec rm {} + was 
introduced 20 years ago.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Markus Kovero

Hi, while not providing complete solution, I'd suggest turning atime off so 
find/rm does not change access time and possibly destroying unnecessary 
snapshots before removing files, should be quicker.


Yours
Markus Kovero


-Original Message-
From: zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Mikko Lammi
Sent: 5. tammikuuta 2010 12:35
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] Clearing a directory with more than 60 million files

Hello,

As a result of one badly designed application running loose for some time,
we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly now that
we need to get rid of them (because they eat 80% of disk space) it seems
to be quite challenging.

Traditional approaches like find ./ -exec rm {} \; seem to take forever
- after running several days, the directory size still says the same. The
only way how I've been able to remove something has been by giving rm
-rf to problematic directory from parent level. Running this command
shows directory size decreasing by 10,000 files/hour, but this would still
mean close to ten months (over 250 days) to delete everything!

I also tried to use unlink command to directory as a root, as a user who
created the directory, by changing directory's owner to root and so forth,
but all attempts gave Not owner error.

Any commands like ls -f or find will run for hours (or days) without
actually listing anything from the directory, so I'm beginning to suspect
that maybe the directory's data structure is somewhat damaged. Is there
some diagnostics that I can run with e.g zdb to investigate and
hopefully fix for a single directory within zfs dataset?

To make things even more difficult, this directory is located in rootfs,
so dropping the zfs filesystem would basically mean reinstalling the
entire system, which is something that we really wouldn't wish to go.


OS is Solaris 10, zpool version is 10 (rather old, I know, but is there
easy path for upgrade that might solve this problem?) and the zpool
consists two 146 GB SAS drivers in a mirror setup.


Any help would be appreciated.

Thanks,
Mikko

-- 
 Mikko Lammi | l...@lmmz.net | http://www.lmmz.net

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Mike Gerdts

On Tue, Jan 5, 2010 at 4:34 AM, Mikko Lammi mikko.la...@lmmz.net wrote:
 Hello,

 As a result of one badly designed application running loose for some time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

 Traditional approaches like find ./ -exec rm {} \; seem to take forever
 - after running several days, the directory size still says the same. The
 only way how I've been able to remove something has been by giving rm
 -rf to problematic directory from parent level. Running this command
 shows directory size decreasing by 10,000 files/hour, but this would still
 mean close to ten months (over 250 days) to delete everything!

 I also tried to use unlink command to directory as a root, as a user who
 created the directory, by changing directory's owner to root and so forth,
 but all attempts gave Not owner error.

 Any commands like ls -f or find will run for hours (or days) without
 actually listing anything from the directory, so I'm beginning to suspect
 that maybe the directory's data structure is somewhat damaged. Is there
 some diagnostics that I can run with e.g zdb to investigate and
 hopefully fix for a single directory within zfs dataset?

In situations like this, ls will be exceptionally slow partially
because it will sort the output.  Find is slow because it needs to
call lstat() on every entry.  In similar situations I have found the
following to work.

perl -e 'opendir(D, .); while ( $d = readdir(D) ) { print $d\n }'

Replace print with unlink if you wish...


 To make things even more difficult, this directory is located in rootfs,
 so dropping the zfs filesystem would basically mean reinstalling the
 entire system, which is something that we really wouldn't wish to go.


 OS is Solaris 10, zpool version is 10 (rather old, I know, but is there
 easy path for upgrade that might solve this problem?) and the zpool
 consists two 146 GB SAS drivers in a mirror setup.


 Any help would be appreciated.

 Thanks,
 Mikko

 --
  Mikko Lammi | l...@lmmz.net | http://www.lmmz.net

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Michael Schuster


Mike Gerdts wrote:

On Tue, Jan 5, 2010 at 4:34 AM, Mikko Lammi mikko.la...@lmmz.net wrote:

Hello,

As a result of one badly designed application running loose for some time,
we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly now that
we need to get rid of them (because they eat 80% of disk space) it seems
to be quite challenging.

Traditional approaches like find ./ -exec rm {} \; seem to take forever
- after running several days, the directory size still says the same. The
only way how I've been able to remove something has been by giving rm
-rf to problematic directory from parent level. Running this command
shows directory size decreasing by 10,000 files/hour, but this would still
mean close to ten months (over 250 days) to delete everything!

I also tried to use unlink command to directory as a root, as a user who
created the directory, by changing directory's owner to root and so forth,
but all attempts gave Not owner error.

Any commands like ls -f or find will run for hours (or days) without
actually listing anything from the directory, so I'm beginning to suspect
that maybe the directory's data structure is somewhat damaged. Is there
some diagnostics that I can run with e.g zdb to investigate and
hopefully fix for a single directory within zfs dataset?


In situations like this, ls will be exceptionally slow partially
because it will sort the output. 


that's what '-f' was supposed to avoid, I'd guess.

Michael
--
Michael Schusterhttp://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] preview of new SSD based on SandForce controller

Juergen Nickelsen n...@jnickelsen.de wrote:

 joerg.schill...@fokus.fraunhofer.de (Joerg Schilling) writes:

  The netapps patents contain claims on ideas that I invented for my Diploma 
  thesis work between 1989 and 1991, so the netapps patents only describe 
  prior
  art. The new ideas introduced with wofs include the ideas on how to use 
  COW
  for filesystems and on how to find the most recent superblock on a COW 
  filesystem. The ieas for the latter method have been developed while 
  discussing the wofs structure with Casten Bormann at TU-Berlin.

 Would you perhaps be willing to share the text? Sounds quite
 interesting, especially to compare it with ZFS and with Netapp's
 introduction to WAFL that I read a while ago.

If you are interested in the text, this is here:

http://cdrecord.berlios.de/private/wofs.ps.gz   (this is the old original 
without images as they have
not been created with troff)
http://cdrecord.berlios.de/private/WoFS.pdf (this is a reformatted version
with images included).

If you like to see the program code, I am considering to make it available
at some time.


 (And I know that discussions with Carsten Bormann can
 result in remarkable results -- not that I would want to disregard
 your own part in these ideas. :-)

Yes, he is a really helpful discussion partner.

As a note: the basic ideas for implementing COW (such as inverting the tree 
structure in order to avoid to rewrite all directories upwards to the root 
directory in case that a nested file is updated, the idea to use generation 
nodes called G-nodes and the idea on how updated super blocks can be found)
have been invented by me. Carsten helped to develop a method that allows to 
define and locate extension areas for updated super blocks for the case when 
the primary super clock update area has become full. The latter idea is not 
needed on a hard disk based filesystem as hard disks allw to overwrite old 
superblock locations. On a WORM media, this is essential to make sure that 
the medium is usable for writing as long as there are unwritten blockd.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

On Tue, January 5, 2010 05:34, Mikko Lammi wrote:

 As a result of one badly designed application running loose for some time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

How about creating a new data set, moving the directory into it, and then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Casper . Dik


On Tue, January 5, 2010 05:34, Mikko Lammi wrote:

 As a result of one badly designed application running loose for some time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

How about creating a new data set, moving the directory into it, and then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk

The move will create and remove the files; the remove by mv will be as
inefficient removing them one by one.

rm -rf would be at least as quick.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Mikko Lammi

On Tue, January 5, 2010 17:08, David Magda wrote:
 On Tue, January 5, 2010 05:34, Mikko Lammi wrote:

 As a result of one badly designed application running loose for some
 time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now
 that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

 How about creating a new data set, moving the directory into it, and then
 destroying it?

 Assuming the directory in question is /opt/MYapp/data:
   1. zfs create rpool/junk
   2. mv /opt/MYapp/data /rpool/junk/
   3. zfs destroy rpool/junk

Tried that as well. It's moving individual files to the new directory with
speed approx. 3,000/minute, so it's not any faster than anything that I
can apply directly to the original directory.

I also tried the perl script that does readdir() earlier (it's as slow as
any other application), and switched the zfs dataset parameter atime to
off, but that didn't had much effect either.

However when we deleted some other files from the volume and managed to
raise free disk space from 4 GB to 10 GB, the rm -rf directory method
started to perform significantly faster. Now it's deleting around 4,000
files/minute (240,000/h - quite an improvement from 10,000/h). I remember
that I saw some discussion related to ZFS performance when filesystem
becomes very full, so I wonder if that was the case here.

Next I'm going to try if that find ./ -exec {} + yelds any better
results than rm -rf from parent directory. But I guess at some point the
bottleneck will be just CPU (this is a 1-Ghz T1000 system) and disk I/O,
not the ZFS filesystem. I'm just wondering of what kind of figures to
expect.


regards,
Mikko

-- 
 Mikko Lammi | l...@lmmz.net | http://www.lmmz.net

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris 10 and ZFS dedupe status

2010-01-05 Thread Bob Friesenhahn


On Mon, 4 Jan 2010, Tony Russell wrote:

I am under the impression that dedupe is still only in OpenSolaris 
and that support for dedupe is limited or non existent.  Is this 
true?  I would like to use ZFS and the dedupe capability to store 
multiple virtual machine images.  The problem is that this will be 
in a production environment and would probably call for Solaris 10 
instead of OpenSolaris.  Are my statements on this valid or am I off 
track?


If dedup gets scheduled for Solaris 10 (I don't know), it would surely 
not be available until at least a year from now.


Dedup in OpenSolaris still seems risky to use other than for 
experimental purposes.  It has only recently become available.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

On Tue, January 5, 2010 10:12, casper@sun.com wrote:

How about creating a new data set, moving the directory into it, and then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk

 The move will create and remove the files; the remove by mv will be as
 inefficient removing them one by one.

 rm -rf would be at least as quick.

Normally when you do a move with-in a 'regular' file system all that's
usually done is the directory pointer is shuffled around. This is not the
case with ZFS data sets, even though they're on the same pool?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Casper . Dik


On Tue, January 5, 2010 10:12, casper@sun.com wrote:

How about creating a new data set, moving the directory into it, and then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk

 The move will create and remove the files; the remove by mv will be as
 inefficient removing them one by one.

 rm -rf would be at least as quick.

Normally when you do a move with-in a 'regular' file system all that's
usually done is the directory pointer is shuffled around. This is not the
case with ZFS data sets, even though they're on the same pool?


Only within a single zfs you can rename files; but within a zpool but on 
different zfs's, you will need to copy and remove.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Dennis Clarke


 On Tue, January 5, 2010 10:12, casper@sun.com wrote:

How about creating a new data set, moving the directory into it, and
 then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk

 The move will create and remove the files; the remove by mv will be
 as
 inefficient removing them one by one.

 rm -rf would be at least as quick.

 Normally when you do a move with-in a 'regular' file system all that's
 usually done is the directory pointer is shuffled around. This is not the
 case with ZFS data sets, even though they're on the same pool?


You can also use star which may speed things up, safely.

star -copy -p -acl -sparse -dump -xdir -xdot -fs=96m -fifostats -time \
-C source_dir . destination_dir


that will buffer the transport of the data from source to dest via memory
and work to keep that buffer full as data is written on the output side.
Its probably at least as fast as mv and probably safer because you never
delete the original until after the copy is complete.


-- 
Dennis Clarke
dcla...@opensolaris.ca  - Email related to the open source Solaris
dcla...@blastwave.org   - Email related to open source for Solaris


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool destroy -f hangs system, now zpool import hangs system.

2010-01-05 Thread Carl Rathman

On Mon, Jan 4, 2010 at 8:59 PM, Richard Elling richard.ell...@gmail.com wrote:

 On Jan 4, 2010, at 6:40 AM, Carl Rathman wrote:

 I have a zpool raidz1 array (called storage) that I created under snv_118.

 I then created a zfs filesystem called storage/vmware which I shared
 out via iscsi.



 I then deleted the vmware filesystem, using 'zpool destroy -f
 storage/vmware' -- which resulted in heavy disk activity, and then
 hard locked the system after 10 minutes.

 I rebooted the machine, but was unable to boot. The machine would hang
 on Reading ZFS Configuration: - (the stick wouldn't even spin.)

 I was able to work around that by booting to a live CD, and deleting
 the zfs cache on my rpool.

 [clicked the wrong button]
 If you destroy the pool, then why try to import?
  -- richard

 zpool import sees my raidz1 array, but if I try 'zpool import -f
 storage', I get the same behavior of heavy disk activity for
 approximately 10 minutes, then a hard lock.


 Any clues on this one?

 Thanks,
 Carl
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



I didn't mean to destroy the pool.  I used zpool destroy on a zvol,
when I should have used zfs destroy.

When I used zpool destroy -f mypool/myvolume the machine hard locked
after about 20 minutes.

I don't want to destroy the pool, I just wanted to destroy the one
volume. -- Which is why I now want to import the pool itself. Does
that make sense?

Thanks,
Carl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

On Tue, January 5, 2010 10:50, Michael Schuster wrote:
 David Magda wrote:
 Normally when you do a move with-in a 'regular' file system all that's
 usually done is the directory pointer is shuffled around. This is not
 the case with ZFS data sets, even though they're on the same pool?

 no - mv doesn't know about zpools, only about posix filesystems.

So the delineation of POSIX file systems is done at the data set layer,
and not at the zpool layer. (Which makes sense since the output of 'df'
tends to closely mimic the output of 'zfs list'.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Roch

Richard Elling writes:
On Jan 3, 2010, at 11:27 PM, matthew patton wrote:

I find it baffling that RaidZ(2,3) was designed to split a record-
size block into N (N=# of member devices) pieces and send the
uselessly tiny requests to spinning rust when we know the massive
delays entailed in head seeks and rotational delay. The ZFS-mirror
and load-balanced configuration do the obviously correct thing and
don't split records and gain more by utilizing parallel access. I
can't imagine the code-path for RAIDZ would be so hard to fix.

Knock yourself out :-)
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c

I've read posts back to 06 and all I see are lamenting about the
horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to
claw back performance by combining multiple such vDEVs. I understand
RAIDZ will never equal Mirroring, but it could get damn close if it
didn't break requests down and better yet utilized copies=N and
properly placed the copies on disparate spindles. This is somewhat
analogous to what the likes of 3PAR do and it's not rocket science.

That is not the issue for small, random reads. For all reads, the
checksum is
verified. When you spread the record across multiple disks, then you
need
to read the record back from those disks. In general, this means that as
long as the recordsize is larger than the requested small read, then
your
performance will approach the N/(N-P) * IOPS limit. At the
pathological edge,
you can set recordsize to 512 bytes and you end up with mirroring (!)
The small, random read performance model I developed only calculates
the above IOPS limit, and does not consider recordsize.

The physical I/O is much more difficult to correlate to the logical I/
O because
of all of the coalescing and caching that occurs at all of the lower
levels in
the stack.

An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount
of storage but the latter is a hell of a lot more resilient and max
IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still
be 1/2 the IOPS of the 8 disk mirror but I'd at least save a bundle
of coin in either reduced spindle count or using slower drives.

With all the great things ZFS is capable of, why hasn't this been
redesigned long ago? what glaringly obvious truth am I missing?

Performance, dependability, space: pick two.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

If store record X in one column like raid-5 or 6 does; then
you need to generate parity for that record X by grouping
with other unrelated records Y, Z, T etc. When X if freed in the
filesystem, it still holds parity information protecting Y,
Z, T so you can't get rid of what was stored @ X. If you try
to store new data in X and in associated parity by fail in
mid-stream you hit the raid-5 write hole. Moreover now
that X is not referenced in the filesystem, no more checksum
is associated with it and if bit rot occurs in X and disk
holding Y dies, resilvering would generate garbage for Y.

This seems to force use to chunk up disks with every unit
checksummed even if freed. Secure deletion becomes a problem
as well. And you need can end up madly searching for free
stripes, repositioning old blocks in partial striped even if
the pool is just 10% filled up.

Can one do this with raid-dp ?
http://blogs.sun.com/roch/entry/need_inodes

That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files


On Jan 5, 2010, at 2:34 AM, Mikko Lammi wrote:


Hello,

As a result of one badly designed application running loose for some  
time,

we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly  
now that
we need to get rid of them (because they eat 80% of disk space) it  
seems

to be quite challenging.

Traditional approaches like find ./ -exec rm {} \; seem to take  
forever
- after running several days, the directory size still says the  
same. The

only way how I've been able to remove something has been by giving rm
-rf to problematic directory from parent level. Running this command
shows directory size decreasing by 10,000 files/hour, but this would  
still

mean close to ten months (over 250 days) to delete everything!


This is, in part, due to stat() slowness.  Fixed in later OpenSolaris  
builds.

I have no idea if or when the fix will be backported to Solaris 10.

I also tried to use unlink command to directory as a root, as a  
user who
created the directory, by changing directory's owner to root and so  
forth,

but all attempts gave Not owner error.

Any commands like ls -f or find will run for hours (or days)  
without
actually listing anything from the directory, so I'm beginning to  
suspect
that maybe the directory's data structure is somewhat damaged. Is  
there

some diagnostics that I can run with e.g zdb to investigate and
hopefully fix for a single directory within zfs dataset?

To make things even more difficult, this directory is located in  
rootfs,

so dropping the zfs filesystem would basically mean reinstalling the
entire system, which is something that we really wouldn't wish to go.


How are the files named?  If you know something about the filename
pattern, then you could create subdirs and mv large numbers of files
to reduce the overall size of a single directory.  Something like:

mkdir .A
mv A* .A
mkdir .B
mv B* .B
...

Also, as previously noted, atime=off.

If you can handle a reboot, you can bump the size of the DNLC, which
might help also.  OTOH, if you can reboot you can also run the latest
b130 livecd which has faster stat().
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

Michael Schuster michael.schus...@sun.com wrote:

  rm -rf would be at least as quick.
  
  Normally when you do a move with-in a 'regular' file system all that's
  usually done is the directory pointer is shuffled around. This is not the
  case with ZFS data sets, even though they're on the same pool?

 no - mv doesn't know about zpools, only about posix filesystems.

mv first tries to rename(2) the file. If this does not succeed but results in 
EXDEV, it copies the file.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can't export pool after zfs receive

2010-01-05 Thread David Dyer-Bennet


On Mon, January 4, 2010 13:51, Ross wrote:
 I initialized a new whole-disk pool on an external
 USB drive, and then did zfs send from my big data pool and zfs recv onto
 the
 new external pool.
 Sometimes this fails, but this time it completed.

 That's the key bit for me - zfs send /receive should not just fail at
 random.  It sounds like your problem is not just that you can't export the
 pool.

It's equally flaky in virtual environments, in my experience.  Sadly. 
Send / receive seems to not be ready for prime time yet.  (I had to give
up on incremental completely, since that was erroring out.)

 As Richard says, that sounds like bad hardware / drivers.  Something is
 causing problems for ZFS.

Always possible.  I paid twice or so what I usually pay for systems to buy
this one, including ECC memory and such, but none of that is any guaranty
I got it right.  Frustrating, though.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Casper . Dik


no - mv doesn't know about zpools, only about posix filesystems.

mv doesn't care about filesystems only about the interface provided by 
POSIX.

There is no zfs specific interface which allows you to move a file from
one zfs to the next.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool destroy -f hangs system, now zpool import hangs system.


On Jan 5, 2010, at 7:54 AM, Carl Rathman wrote:


I didn't mean to destroy the pool.  I used zpool destroy on a zvol,
when I should have used zfs destroy.

When I used zpool destroy -f mypool/myvolume the machine hard locked
after about 20 minutes.


This would be a bug.  zpool destroy should only destroy pools.
Volumes are datasets and are destroyed by zfs destroy.  Using
zpool destroy -f will attempt to force unmounts of any mounted
datasets, but volumes are not mounted, per se. Upon reboot, nothing
will be mounted until after the pool is imported.



I don't want to destroy the pool, I just wanted to destroy the one
volume. -- Which is why I now want to import the pool itself. Does
that make sense?


If the pool was destroyed, then you can try to import using -D.

Are you sure you didn't zfs destroy instead?  Once the pool is  
imported,

zpool history will show all of the commands issued against the pool.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread David Dyer-Bennet


On Tue, January 5, 2010 10:01, Richard Elling wrote:
 OTOH, if you can reboot you can also run the latest
 b130 livecd which has faster stat().

How much faster is it?  He estimated 250 days to rm -rf them; so 10x
faster would get that down to 25 days, 100x would get it down to 2.5 days
(assuming the entire time is in the stat calls, which is probably not
totally true)

It's interesting how our ability to build larger disks, and our software's
ability to do things like create really large numbers of files, comes back
to bite us on the ass every now and then.

I hope he has a background process running chipping away at it; I don't
THINK 250 days in the background is going to turn out to be the best
answer, but one might as well start the clock running just in case.

Best answer might turn out to be to copy off the less than 20% good data
and just scrag the pool.  Inelegant, but might result in less downtime, or
in getting the space back much faster.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris 10 and ZFS dedupe status

2010-01-05 Thread Henrik Johansson


On Jan 5, 2010, at 4:38 PM, Bob Friesenhahn wrote:

 On Mon, 4 Jan 2010, Tony Russell wrote:
 
 I am under the impression that dedupe is still only in OpenSolaris and that 
 support for dedupe is limited or non existent.  Is this true?  I would like 
 to use ZFS and the dedupe capability to store multiple virtual machine 
 images.  The problem is that this will be in a production environment and 
 would probably call for Solaris 10 instead of OpenSolaris.  Are my 
 statements on this valid or am I off track?
 
 If dedup gets scheduled for Solaris 10 (I don't know), it would surely not be 
 available until at least a year from now.
 
 Dedup in OpenSolaris still seems risky to use other than for experimental 
 purposes.  It has only recently become available.

I've just wrote an entry about update 9,  I think it will contain zpool version 
19, so no dedup for this release if that's  correct.

Regards

Henrik
http://sparcv9.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-05 Thread Kjetil Torgrim Homme

Brad bene...@yahoo.com writes:

 Hi Adam,

I'm not Adam, but I'll take a stab at it anyway.

BTW, your crossposting is a bit confusing to follow, at least when using
gmane.org.  I think it is better to stick to one mailing list anyway?

 From your the picture, it looks like the data is distributed evenly
 (with the exception of parity) across each spindle then wrapping
 around again (final 4K) - is this one single write operation or two?

it is a single write operation per device.  actually, it may be less
than one write operation since the transaction group, which probably
contains many more updates, is written as a whole.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files


On Jan 5, 2010, at 8:13 AM, David Dyer-Bennet wrote:


On Tue, January 5, 2010 10:01, Richard Elling wrote:

OTOH, if you can reboot you can also run the latest
b130 livecd which has faster stat().


How much faster is it?  He estimated 250 days to rm -rf them; so 10x
faster would get that down to 25 days, 100x would get it down to 2.5  
days

(assuming the entire time is in the stat calls, which is probably not
totally true)


dunno, nothing useful in the public bug report :-(
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6775100

It's interesting how our ability to build larger disks, and our  
software's
ability to do things like create really large numbers of files,  
comes back

to bite us on the ass every now and then.


Wait until you try it with dedup... not only will you need to update a  
lot

of metadata, but also a lot of DTT entries.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool destroy -f hangs system, now zpool import hangs system.

2010-01-05 Thread Carl Rathman

On Tue, Jan 5, 2010 at 10:12 AM, Richard Elling
richard.ell...@gmail.com wrote:
 On Jan 5, 2010, at 7:54 AM, Carl Rathman wrote:

 I didn't mean to destroy the pool.  I used zpool destroy on a zvol,
 when I should have used zfs destroy.

 When I used zpool destroy -f mypool/myvolume the machine hard locked
 after about 20 minutes.

 This would be a bug.  zpool destroy should only destroy pools.
 Volumes are datasets and are destroyed by zfs destroy.  Using
 zpool destroy -f will attempt to force unmounts of any mounted
 datasets, but volumes are not mounted, per se. Upon reboot, nothing
 will be mounted until after the pool is imported.


 I don't want to destroy the pool, I just wanted to destroy the one
 volume. -- Which is why I now want to import the pool itself. Does
 that make sense?

 If the pool was destroyed, then you can try to import using -D.

 Are you sure you didn't zfs destroy instead?  Once the pool is imported,
 zpool history will show all of the commands issued against the pool.
  -- richard



Hi Richard,

If I could import the pool, I'd love to do a history on it.

At this point, if I attempt to import the pool, the machine will have
heavy disk activity on the pool for approximately 10 minutes, then the
machine will hard lock. This will happen when I boot the machine from
its snv_130 rpool, or if I boot the machine from a snv_130 live cd.

Thanks,
Carl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recovering ZFS stops after syseventconfd can't fork

2010-01-05 Thread Cindy Swearingen


Hi Paul,

I opened 6914208 to cover the sysevent/zfsdle problem.

If the system crashed due to a power failure and the disk labels for
this pool were corrupted, then I think you will need to follow the steps 
to get the disks relabeled correctly. You might review some previous 
postings by Victor Latuskin that describe these steps.


Thanks,

Cindy

On 12/28/09 11:17, Paul Armstrong wrote:

Alas, even moving the file out of the way and rebooting the box (to guarantee 
state) didn't work:

-bash-4.0# zpool import -nfFX hds1
echo $?
-bash-4.0# echo $?
1

Do you need to be able to read all the labels for each disk in the array in 
order to recover?


From zdb -l on one of the disks:


LABEL 3

failed to unpack label 3

Thanks,
Paul

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Tim Cook

On Tue, Jan 5, 2010 at 11:25 AM, Richard Elling richard.ell...@gmail.comwrote:

 On Jan 5, 2010, at 8:13 AM, David Dyer-Bennet wrote:


 On Tue, January 5, 2010 10:01, Richard Elling wrote:

 OTOH, if you can reboot you can also run the latest
 b130 livecd which has faster stat().


 How much faster is it?  He estimated 250 days to rm -rf them; so 10x
 faster would get that down to 25 days, 100x would get it down to 2.5 days
 (assuming the entire time is in the stat calls, which is probably not
 totally true)


 dunno, nothing useful in the public bug report :-(
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6775100


  It's interesting how our ability to build larger disks, and our software's
 ability to do things like create really large numbers of files, comes back
 to bite us on the ass every now and then.


 Wait until you try it with dedup... not only will you need to update a lot
 of metadata, but also a lot of DTT entries.
  -- richard



I recall pointing this out over a year ago when I said claiming unlimited
snapshots and filesystems was disingenuous at best, and that likely we'd
need to see artificial limitations to make many of these features usable.
But I digress :)

-- 
--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size


On 05/01/2010 16:00, Roch wrote:

That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.

   


Have you got any benchmarks available (comparing 512B to 4K to classical 
RAID-5)?


The problem is that while RAID-Z is really good for some workloads it is 
really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but for 
some workloads it won't (due to the huge size of a working set). In such 
environments RAID-Z2 offers much worse performance then similarly 
configured NetApp (RAID-DP, same number of disk drives). If ZFS would 
provide another RAID-5/RAID-6 like protection but with different 
characteristics so writing to a pool would be slower but reading from it 
would be much faster (comparable to RAID-DP) some customers would be 
very happy. Then maybe a new kind of cache device would be needed to 
buffer writes to NV storage to make writes faster (like HW arrays have 
been doing for years).



A possible *workaround* is to use SVM to set-up RAID-5 and create a zfs 
pool on top of it.

How does SVM handle R5 write hole? IIRC SVM doesn't offer RAID-6.

--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Daniel Rock


Am 05.01.2010 16:22, schrieb Mikko Lammi:

However when we deleted some other files from the volume and managed to
raise free disk space from 4 GB to 10 GB, the rm -rf directory method
started to perform significantly faster. Now it's deleting around 4,000
files/minute (240,000/h - quite an improvement from 10,000/h). I remember
that I saw some discussion related to ZFS performance when filesystem
becomes very full, so I wonder if that was the case here.


I did some tests. They were done on an Ultra 20 (2.2 GHz Dual-Core Opteron) 
with crappy SATA disks. On this machine creation and deletion of files were 
I/O bound. I was able to create about 1 Mio. files per hour. I stopped after 
5 hours, so I had approx. 5 Mio. files in one directory.


Deletion (via the Perl script) also had a rate of ~1 Mio. files per hour. 
During deletion the disks (mirrored zpool) were both 95% busy, CPU time was 
5% total.


If the T1000 has SCSI disks you can turn on write cache on both disks 
(though in my tests on delete most I/O were read operations). For the rpool 
it will probably not be enabled by default because your are just using 
partitions:


# format -e
[select disk]
format scsi
scsi p8 b2 |= 4

Mode select on page 8 ok.

scsi quit

Disable write cache:

scsi p8 b2 = ~4


(Yes I know, there is a cache command in format, but I'm used to above
commands a long time before the cache command was introduced)


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Joe Blount


On 01/ 5/10 10:01 AM, Richard Elling wrote:

How are the files named?  If you know something about the filename
pattern, then you could create subdirs and mv large numbers of files
to reduce the overall size of a single directory.  Something like:

mkdir .A
mv A* .A
mkdir .B
mv B* .B
...

I doubt that would be a faster option.  Unless you can be certain the 
file naming coincides with the unsorted order of the files in the 
directory.  Because if A* does not occur the beginning of the 
directory's contents, finding them will be painful.  The above process 
would add many cycles of scanning through all 60 million directory 
entries.  And each move request will churn the vnode cache.



A while back I did some experimenting with millions of files per 
directory.  Note that the time estimates are overstating how long it 
will take.  The more files you remove, the faster it will go.  I would 
be trying to get an unsorted read of the directory, and delete them in 
that order.  This is not just to save the time it takes to sort the 
output.  It will also mimimize vnode cache churn, and the time to remove 
each object.  Each remove request must iterate the directory looking for 
the object to remove. 

Newer ON builds support the -U option to ls, for unsorted output.  I 
don't know what may exist on S10.  FWIW, I copied the 'ls' binary from a 
ON128 machine to /tmp/myls a S10 machine, and it appeared to work - I 
don't know if there are any issues/risks with doing that.



Since its a niagara system, it might go faster if you can get multiple 
removes going in parallel.  But only if you all the parallel remove 
requests can be on files near the beginning of the directory's 
contents.If you can't get an unsorted list of files, then multiple 
threads will just add to the vnode cache thrashing.



It might be worth trying something like this:
ls -U  remove.sh
Make it a bash script.
prepend rm -f to each line, and append an  to each line.
Maybe every few hundred lines put in a wait.  (in case the rm's can be 
kicked off significantly faster than they can be completed, you don't 
want millions of rm's to get started)


You'll have to wait on ls to do one unsorted read of the directory.  
Then you will get parallel remove requests going, and always on files at 
the beginning of the directory.  There should be minimal vnode churn 
during the removes.


Starting the new processes for the removes may counteract the benefit of 
parallelizing, and make this slower.  But since its a Niagara system, 
you may have the spare cpu cycles to waste anyway.  Its just another 
idea to try...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Paul Gress


On 01/ 5/10 05:34 AM, Mikko Lammi wrote:

Hello,

As a result of one badly designed application running loose for some time,
we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly now that
we need to get rid of them (because they eat 80% of disk space) it seems
to be quite challenging.
   


I've been following this thread.  Would it be faster to do the reverse.  
Copy the 20% of disk then format then move the 20% back.


Paul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Michael Schuster


Paul Gress wrote:

On 01/ 5/10 05:34 AM, Mikko Lammi wrote:

Hello,

As a result of one badly designed application running loose for some time,
we now seem to have over 60 million files in one directory. Good thing
about ZFS is that it allows it without any issues. Unfortunatelly now that
we need to get rid of them (because they eat 80% of disk space) it seems
to be quite challenging.
  


I've been following this thread.  Would it be faster to do the reverse.  
Copy the 20% of disk then format then move the 20% back.


I'm not sure the OS installation would survive that.

Michael
--
Michael Schusterhttp://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Fajar A. Nugraha

On Wed, Jan 6, 2010 at 12:44 AM, Michael Schuster
michael.schus...@sun.com wrote:
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.


 I've been following this thread.  Would it be faster to do the reverse.
  Copy the 20% of disk then format then move the 20% back.

 I'm not sure the OS installation would survive that.

... even when done from a live/rescue CD session?

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import -f not forceful enough?

2010-01-05 Thread Cindy Swearingen


Hi Dan,

Can you describe what you are trying to recover from with more details
because we can't quite follow what steps might have lead to this
scenario.

For example, your hdc pool had two disks, c1t0d0s0 and c8t1d0s0,
and your rpool has c8t0d0s0 so s8t0d0s0 cannot be wiped clean.

Maybe you mean c8t1d0s0 is now relabeled (with labelfix) and
reconnected for hdc but c1t0d0s0 is still corrupted so the hdc pool
cannot be re-imported is my guess...

Thanks,

Cindy


On 01/03/10 14:49, Dan McDonald wrote:

I had to use the labelfix hack (and I had to recompile it at that) on 1/2 of an 
old zpool.  I made this change:

/* zio_checksum(ZIO_CHECKSUM_LABEL, zc, buf, size); */
zio_checksum_table[ZIO_CHECKSUM_LABEL].ci_func[0](buf, size, zc);

and I'm assuming [0] is the correct endianness, since afterwards I saw it come up with 
zpool import.

Unfortunately, I can't import it.  Here's what happens:

# uname -a
SunOS neuromancer 5.11 snv_130 i86pc i386 i86pc
# zpool import
  pool: hdc
id: 18323387294498987089
 state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

hdc   FAULTED  corrupted data
  mirror-0DEGRADED
c1t0d0s0  FAULTED  corrupted data
c8t1d0s0  ONLINE
# zpool import -f hdc
cannot import 'hdc': one or more devices is currently unavailable
Destroy and re-create the pool from
a backup source.
# zpool status
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c8t0d0s0  ONLINE   0 0 0

errors: No known data errors
#

Note that c1t0d0s0 was on the old system, and that it's now the (wiped clean) 
c8t0d0s0.  Any clues are, as always, welcome.  I'd prefer not to restore my 
saved zfs-send streams, so I'd like to get the import of the old root pool 
(hdc) to work.

Thanks!
Dan McD.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import -f not forceful enough?

2010-01-05 Thread Dan McDonald

 Hi Dan,
 
 Can you describe what you are trying to recover from
 with more details
 because we can't quite follow what steps might have
 lead to this
 scenario.

Sorry.

I was running Nevada 103 with a root zpool called hdc with c1t0d0s0 and 
c1t1d0s0.

I first uttered:  zpool detach hdc c1t1d0s0.

I then detached that drive, installed OpenSolaris on what *was* c1t0d0s0.  Once 
I rebooted into OpenSolaris, I noticed that the drive had become c8t0d0s0.

Assuming the same remapping happened to the other drive, I plugged it back in, 
ran labelfix on c8t1d0s0, and now got to see that pool hdc assumes the mirror 
is c1t0d0s0 and c8t1d0s0.  I have no idea what c1t0d0s0 points to on the new 
OpenSolaris view of things, but it is definitely corrupt from the point of view 
of a ZFS mirrored pool.

So it sounds like I cannot whack c8t1d0s0 to be a pool all by itself and I 
should just give up.  Is that correct?

Thanks,
Dan
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread A Darren Dunham

On Tue, Jan 05, 2010 at 04:49:00PM +, Robert Milkowski wrote:
 A possible *workaround* is to use SVM to set-up RAID-5 and create a
 zfs pool on top of it.
 How does SVM handle R5 write hole? IIRC SVM doesn't offer RAID-6.

As far as I know, it does not address it.  It's possible that adding a
transaction volume would help by replaying anything that affected the
volume, but I don't know that sufficient information is present.

Symantec Volume Manager offers an explicit Raid5 log device.  There
doesn't appear to be any corresponding object in SVM.

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Roch Bourbonnais



Le 5 janv. 10 à 17:49, Robert Milkowski a écrit :


On 05/01/2010 16:00, Roch wrote:

That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.




Have you got any benchmarks available (comparing 512B to 4K to  
classical RAID-5)?


Using 8K 'soft' sector prototype on an otherwise plain raid-z layout,  
we got 8X more random reads than with 512B sectors; as would be  
expected.




The problem is that while RAID-Z is really good for some workloads  
it is really bad for others.


The bigger sector makes raid-z like mirroring for small records. And  
so performance of raid-z will be very good and it's also space  
efficient for large objects.


Sometimes having L2ARC might effectively mitigate the problem but  
for some workloads it won't (due to the huge size of a working set).  
In such environments RAID-Z2 offers much worse performance then  
similarly configured NetApp (RAID-DP, same number of disk drives).  
If ZFS would provide another RAID-5/RAID-6 like protection but with  
different characteristics so writing to a pool would be slower but  
reading from it would be much faster (comparable to RAID-DP) some  
customers would be very happy.


Agreed.

Then maybe a new kind of cache device would be needed to buffer  
writes to NV storage to make writes faster (like HW arrays have  
been doing for years).




Writes are not the problem and we have log device to offload them.  
It's really about maintaining integrity of raid-5 type layout in the  
presence of bit-rot even if such

bit-rot occur within free space.



A possible *workaround* is to use SVM to set-up RAID-5 and create a  
zfs pool on top of it.

How does SVM handle R5 write hole? IIRC SVM doesn't offer RAID-6.



It doesn't.



--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




smime.p7s
Description: S/MIME cryptographic signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size


On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:


On 05/01/2010 16:00, Roch wrote:

That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.




Have you got any benchmarks available (comparing 512B to 4K to  
classical RAID-5)?


Not fair!  A 512 byte random write workload will absolutely clobber a
RAID-5 implementation. It is the RAID-5 pathological worst case.
For many arrays, even a 4 KB random write workload will suck most
heinously.

The raidz pathological worst case is a random read from many-column
raidz where files have records 128 KB in size.  The inflated read  
problem

is why it makes sense to match recordsize for fixed record workloads.
This includes CIFS workloads which use 4 KB records. It is also why
having many columns in the raidz for large records does not improve
performance. Hence the 3 to 9 raidz disk limit recommendation in the
zpool man page.

http://www.baarf.com

The problem is that while RAID-Z is really good for some workloads  
it is really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but  
for some workloads it won't (due to the huge size of a working set).  
In such environments RAID-Z2 offers much worse performance then  
similarly configured NetApp (RAID-DP, same number of disk drives).  
If ZFS would provide another RAID-5/RAID-6 like protection but with  
different characteristics so writing to a pool would be slower but  
reading from it would be much faster (comparable to RAID-DP) some  
customers would be very happy. Then maybe a new kind of cache device  
would be needed to buffer writes to NV storage to make writes faster  
(like HW arrays have been doing for years).


This still does not address the record checksum.  This is only a problem
for small, random read workloads, which means L2ARC is a good solution.
If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO
HDD and good performance do not belong in the same sentence anymore.
Game over -- SSDs win.

A possible *workaround* is to use SVM to set-up RAID-5 and create a  
zfs pool on top of it.

How does SVM handle R5 write hole? IIRC SVM doesn't offer RAID-6.


IIRC, SVM does a prewrite.  Dog slow.  Also, SVM is, AFAICT, on life  
support.
The source is out there if anyone wants to carry it forward. Actually,  
many of us

would be quite happy for SVM to fade from our memory :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz stripe size (not stripe width)



On Jan 4, 2010, at 7:08 PM, Brad wrote:


Hi Adam,

From your the picture, it looks like the data is distributed evenly  
(with the exception of parity) across each spindle then wrapping  
around again (final 4K) - is this one single write operation or two?


| P | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 | - 
one write op??
| P | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 | - 
one write op??


One physical write op per vdev because the columns will likely be
coalesced at the vdev.  Obviously, one physical write cannot span
multiple vdevs.


For a stripe configuration, is this would it would like look for 8K?

| D00 D01 D02 D03 D04 D05 D06 D07 D08 |
| D09 D10 D11 D12 D13 D14 D15 D16 D17 |


No.  It is very likely the entire write will be to one vdev.  Again,  
this is

dynamic striping, not RAID-0. RAID-0 is defined by SNIA as A disk array
data mapping technique in which fixed-length sequences of virtual disk
data addresses are mapped to sequences of member disk addresses
in a regular rotating pattern.  In ZFS, there is no fixed-length  
sequence.

The next column is chosen approximately every MB or so. You get the
benefit of sequential access to the media, with the stochastic spreading
across vdevs as well.

When you have multiple top-level vdevs, such as multiple mirrors or
multiple raidz sets, then you get the ~ 1MB spread across the top level
and the normal allocations in the sets.  In other words, any given  
record

should be in one set.  Again, this limits hyperspreading and allows you
to scale to very large numbers of disks.  It seems to work reasonably
well in practice. I attempted to describe this in pictures for my ZFS
tutorials.  You can be the judge, and suggestions are always welcome.
See slide 27 at
http://www.slideshare.net/relling/zfs-tutorial-usenix-lisa09-conference

[for the alias, I've only today succeeded in uploading the slides to
slideshare... been trying off and on for more than a month :-(]
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files



On Jan 5, 2010, at 8:52 AM, Daniel Rock wrote:


Am 05.01.2010 16:22, schrieb Mikko Lammi:
However when we deleted some other files from the volume and  
managed to
raise free disk space from 4 GB to 10 GB, the rm -rf directory  
method
started to perform significantly faster. Now it's deleting around  
4,000
files/minute (240,000/h - quite an improvement from 10,000/h). I  
remember

that I saw some discussion related to ZFS performance when filesystem
becomes very full, so I wonder if that was the case here.


I did some tests. They were done on an Ultra 20 (2.2 GHz Dual-Core  
Opteron) with crappy SATA disks. On this machine creation and  
deletion of files were I/O bound. I was able to create about 1 Mio.  
files per hour. I stopped after 5 hours, so I had approx. 5 Mio.  
files in one directory.


Deletion (via the Perl script) also had a rate of ~1 Mio. files per  
hour. During deletion the disks (mirrored zpool) were both 95% busy,  
CPU time was 5% total.


If the T1000 has SCSI disks you can turn on write cache on both  
disks (though in my tests on delete most I/O were read operations).  
For the rpool it will probably not be enabled by default because  
your are just using partitions:


Good observation!  By default, rpool will not have write cache enabled.
It might make a difference to enable the write cache for this operation.
 -- richard



# format -e
[select disk]
format scsi
scsi p8 b2 |= 4

Mode select on page 8 ok.

scsi quit

Disable write cache:

scsi p8 b2 = ~4


(Yes I know, there is a cache command in format, but I'm used to  
above

commands a long time before the cache command was introduced)


Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size


On 05/01/2010 18:37, Roch Bourbonnais wrote:


Writes are not the problem and we have log device to offload them. 
It's really about maintaining integrity of raid-5 type layout in the 
presence of bit-rot even if such

bit-rot occur within free space.



How is it addressed in RAID-DP?


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size


On 05/01/2010 18:49, Richard Elling wrote:
On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote: 


The problem is that while RAID-Z is really good for some workloads it 
is really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but for 
some workloads it won't (due to the huge size of a working set). In 
such environments RAID-Z2 offers much worse performance then 
similarly configured NetApp (RAID-DP, same number of disk drives). If 
ZFS would provide another RAID-5/RAID-6 like protection but with 
different characteristics so writing to a pool would be slower but 
reading from it would be much faster (comparable to RAID-DP) some 
customers would be very happy. Then maybe a new kind of cache device 
would be needed to buffer writes to NV storage to make writes faster 
(like HW arrays have been doing for years).


This still does not address the record checksum.  This is only a problem
for small, random read workloads, which means L2ARC is a good solution.
If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO
HDD and good performance do not belong in the same sentence anymore.
Game over -- SSDs win.



as I wrote - sometimes the working set is so big that L2ARC or not there 
is virtually no difference and it is not practical to deploy L2ARC 
several TBs in size or bigger. For such workload RAID-DP behaves much 
better (many small random reads, not that much writes).



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread David Dyer-Bennet


On Tue, January 5, 2010 10:25, Richard Elling wrote:
 On Jan 5, 2010, at 8:13 AM, David Dyer-Bennet wrote:

 It's interesting how our ability to build larger disks, and our
 software's
 ability to do things like create really large numbers of files,
 comes back
 to bite us on the ass every now and then.

 Wait until you try it with dedup... not only will you need to update a
 lot
 of metadata, but also a lot of DTT entries.

My data consists (by volume) almost entirely of bitmap photo images; I
don't think dedup is going to buy me much, so I'm not leaping into
experimenting with it.

Probably just as well; I don't think I have enough memory for it, either.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Tristan Ball

On 6/01/2010 3:00 AM, Roch wrote:

Richard Elling writes:
On Jan 3, 2010, at 11:27 PM, matthew patton wrote:

Knock yourself out :-)

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c

[snipped for space ]

-r

Sold! Let's do that then! :-)

Seriously - are there design or architectural reasons why this isn't
done by default, or at least an option? Or is it just a no one's had
time to implement yet thing?
I understand that 4K sectors might be less space efficient for lots of
small files, but I suspect lots of us would happilly make that trade off!

Thanks,
Tristan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size


On Jan 5, 2010, at 11:30 AM, Robert Milkowski wrote:

On 05/01/2010 18:49, Richard Elling wrote:

On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:


The problem is that while RAID-Z is really good for some workloads  
it is really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but  
for some workloads it won't (due to the huge size of a working  
set). In such environments RAID-Z2 offers much worse performance  
then similarly configured NetApp (RAID-DP, same number of disk  
drives). If ZFS would provide another RAID-5/RAID-6 like  
protection but with different characteristics so writing to a pool  
would be slower but reading from it would be much faster  
(comparable to RAID-DP) some customers would be very happy. Then  
maybe a new kind of cache device would be needed to buffer writes  
to NV storage to make writes faster (like HW arrays have been  
doing for years).


This still does not address the record checksum.  This is only a  
problem
for small, random read workloads, which means L2ARC is a good  
solution.
If L2ARC is a set of HDDs, then you could gain some advantage, but  
IMHO

HDD and good performance do not belong in the same sentence anymore.
Game over -- SSDs win.



as I wrote - sometimes the working set is so big that L2ARC or not  
there is virtually no difference and it is not practical to deploy  
L2ARC several TBs in size or bigger. For such workload RAID-DP  
behaves much better (many small random reads, not that much writes).


If you are doing small, random reads on dozens of TB of data, then  
you've

got a much bigger problem on your hands... kinda like counting grains of
sand on the beach during low tide :-).  Hopefully, you do not have to  
randomly
update that data because your file system isn't COW :-). Fortunately,  
most

workloads are not of that size and scope.

Since there are already 1 TB SSDs on the market, the only thing  
keeping the
HDD market alive is the low $/TB.  Moore's Law predicts that cost  
advantage

will pass.  SSDs are already the low $/IOPS winners.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size


On Jan 5, 2010, at 11:56 AM, Tristan Ball wrote:

On 6/01/2010 3:00 AM, Roch wrote:

That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.

-r


Sold! Let's do that then! :-)

Seriously - are there design or architectural reasons why this isn't  
done by default, or at least an option? Or is it just a no one's  
had time to implement yet thing?


Waiting on hardware to be become widely available might be a long wait.
See also PSARC 2008/769
http://arc.opensolaris.org/caselog/PSARC/2008/769/inception.materials/design_doc

I understand that 4K sectors might be less space efficient for lots  
of small files, but I suspect lots of us would happilly make that  
trade off!


+1 (for better reliability, too!)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Bob Friesenhahn


On Tue, 5 Jan 2010, Richard Elling wrote:


Since there are already 1 TB SSDs on the market, the only thing keeping the
HDD market alive is the low $/TB.  Moore's Law predicts that cost advantage
will pass.  SSDs are already the low $/IOPS winners.


SSD vendors are still working to stabilize their designs.  Most of 
them seem to be unworthy for use in more than a laptop computer.  A 
number of computer vendors (e.g. Apple  Dell) who offered SSDs in 
their computers encountered an expectedly high rate of product 
failure.


According to Sun's own engineers, Moore's Law is very bad for 
enterprise SSDs.  FLASH devices built to very small geometries are 
more likely to wear out and forget.  Current design trends are moving 
in a direction which is contrary to the requirements of enterprise 
SSDs. See


  http://www.eetimes.com/showArticle.jhtml?articleID=219200284

Perhaps inovative designers like Suncast will figure out how to build 
reliable SSDs based on parts which are more likely to wear out and 
forget.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Tristan Ball




On 6/01/2010 7:19 AM, Richard Elling wrote:


If you are doing small, random reads on dozens of TB of data, then you've
got a much bigger problem on your hands... kinda like counting grains of
sand on the beach during low tide :-).  Hopefully, you do not have to 
randomly
update that data because your file system isn't COW :-). Fortunately, 
most

workloads are not of that size and scope.

Since there are already 1 TB SSDs on the market, the only thing 
keeping the
HDD market alive is the low $/TB.  Moore's Law predicts that cost 
advantage

will pass.  SSDs are already the low $/IOPS winners.
 -- richard

These workloads (small random reads over huge datasets) might be getting 
more common in some environments - because it seems to be what you get 
when you consolidate virtual machine storage.


We've got a moderately large number of Virtual Machines (a mix of 
Debian, Win2K Win2K3) running a very large set of applications, and our 
reads are all over the place! :-( I have to say I remain impressed at 
how well the ARC behaves, but even then our hit rate is often not 
wonderful.


I _dream_ about being able to afford to build out my entire storage from 
cheap/large SSD's. My guess would be that in about 2 years I'll be able 
to. One of the reasons we've essentially put a hold on buying 
enterprise storage or fast FC/SCSI disks. A large part of the 
justification for FC/SCSI disks is their performance, and they're going 
to be completely eclipsed within the lifetime of any serious mid-range 
to high end storage array. Until that day we're make do large sata 
drives, mirrored, with relatively high spindle counts to avoid long 
per-disk queues.


:-)

T

PS:  OK, I know other tier-1 storage vendors have started integrating 
SSD's as well, but they hadn't when we started out current round of 
storage upgrades, and I stilll think opensolaris+sata hdds+ssd's gives 
us a cleaner,cheaper and easier upgrade path than most tier-1 vendors 
can provide.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] (Practical) limit on the number of snapshots?

2010-01-05 Thread Juergen Nickelsen

Is there any limit on the number of snapshots in a file system?

The documentation -- manual page, admin guide, troubleshooting guide
-- does not mention any. That seems to confirm my assumption that is
is probably not a fixed limit, but there may still be a practical
one, just like there is no limit on the number of file systems in a
pool, but nobody would find having a million file systems practical.

I have tried to create a number of snapshots in a file system for a
few hours. An otherwise unloaded X4250 with a nearly empty RAID-Z2
pool of six builtin disks (146 GB, 10K rpm) managed to create a few
snapshots per second in an empty file system.

It had not visibly slowed down when it reached 36051 snapshots after
hours and I stopped it; to my surprise destroying the file system
(with all these snapshots in it) took about as long. With ``iostat
-xn 1'' I could see that the disk usage was still low, at about 13%
IIRC.

So 36000 snapshots in an empty file system is not a problem. Is it
different with a file system that is, say, to 70% full? Or on a
bigger pool? Or with a significantly larger number of snapshots,
say, a million? I am asking for real experience here, not for the
theory.

Regards, Juergen.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size


On 05/01/2010 20:19, Richard Elling wrote:

On Jan 5, 2010, at 11:30 AM, Robert Milkowski wrote:

On 05/01/2010 18:49, Richard Elling wrote:

On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:


The problem is that while RAID-Z is really good for some workloads 
it is really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but 
for some workloads it won't (due to the huge size of a working 
set). In such environments RAID-Z2 offers much worse performance 
then similarly configured NetApp (RAID-DP, same number of disk 
drives). If ZFS would provide another RAID-5/RAID-6 like protection 
but with different characteristics so writing to a pool would be 
slower but reading from it would be much faster (comparable to 
RAID-DP) some customers would be very happy. Then maybe a new kind 
of cache device would be needed to buffer writes to NV storage to 
make writes faster (like HW arrays have been doing for years).


This still does not address the record checksum.  This is only a 
problem

for small, random read workloads, which means L2ARC is a good solution.
If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO
HDD and good performance do not belong in the same sentence anymore.
Game over -- SSDs win.



as I wrote - sometimes the working set is so big that L2ARC or not 
there is virtually no difference and it is not practical to deploy 
L2ARC several TBs in size or bigger. For such workload RAID-DP 
behaves much better (many small random reads, not that much writes).


If you are doing small, random reads on dozens of TB of data, then you've
got a much bigger problem on your hands... kinda like counting grains of
sand on the beach during low tide :-).  Hopefully, you do not have to 
randomly
update that data because your file system isn't COW :-). Fortunately, 
most

workloads are not of that size and scope.



Well, nevertheless some environments are like that (and no, I'm not 
speculating) and the truth is that NetApp with RAID-DP with the same 
amount of disk drives proven to be faster than RAID-Z2 even with a help 
of SSDs as L2ARC. The point is that NetApp allowed to provide the 
capacity of RAID-6 and protection of dual parity while providing better 
performance to RAID-Z2 in the environment.
In other workloads RAIDZ-2 will be better, but not in this particular 
environment.


All I'm saying is that having yet another RAID type in ZFS which offers 
capacity similar to RAID-5/RAID-6 but with different performance 
characteristics so small random reads are on par with RAID-DP while 
sacrificing write performance would be beneficial for some environments.


RAID-Z with bigger sector size could improve performance but provided 
capacity could be much less than RAID-5/6 so it not necessary might be 
an apple-to-apple comparison (but still useful for some environments).


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size


On 05/01/2010 20:19, Richard Elling wrote:

[...] Fortunately, most
workloads are not of that size and scope.



Forgot to mention it in my last email - yes, I agree. The environment 
I'm talking about is rather unusual and in most other cases where 
RAID-5/6 was considered the performance of RAID-Z1/2 was good enough or 
even better.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

2010-01-05 Thread Miles Nordin

 dm == David Magda dma...@ee.ryerson.ca writes:

dm 4096 - to-512 blocks

aiui NAND flash has a minimum write size (determiined by ECC OOB bits)
of 2 - 4kB, and a minimum erase size that's much larger.  Remapping
cannot abstract away the performance implication of the minimum write
size if you are doing a series of synchronous writes smaller than the
minimum size on a device with no battery/capacitor, although using a
DRAM+supercap prebuffer might be able to abstract away some of it.


pgp7ymX3mE7r4.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Michael Herf

Many large-scale photo hosts start with netapp as the default good
enough way to handle multiple-TB storage. With a 1-5% cache on top,
the workload is truly random-read over many TBs. But these workloads
almost assume a frontend cache to take care of hot traffic, so L2ARC
is just a nice implementation of that, not a silver bullet.

I agree that RAID-DP is much more scalable for reads than RAIDZx, and
this basically turns into a cost concern at scale.

The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be
used instead of netapp. But this certainly reduces the cost advantage
significantly.

mike

p.s. I managed the team that built blogger.com's photo hosting, and
picasaweb.google.com, so I've seen some of this stuff at scale
(neither of these use netapp). For large photos, it's pretty simple:
the more independent spindles, the better.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] preview of new SSD based on SandForce controller

2010-01-05 Thread Wes Felter


Eric D. Mudama wrote:

On Mon, Jan  4 at 16:43, Wes Felter wrote:

Eric D. Mudama wrote:


I am not convinced that a general purpose CPU, running other software
in parallel, will be able to be timely and responsive enough to
maximize bandwidth in an SSD controller without specialized hardware
support.


Fusion-io would seem to be a counter-example, since it uses a fairly 
simple controller (I guess the controller still performs ECC and maybe 
XOR) and the driver eats a whole x86 core. The result is very high 
performance.


I see what you're saying, but it isn't obvious (to me) how well
they're using all the hardware at hand.  2GB/s of bandwidth over their
PCI-e link and what looks like a TON of NAND, with a nearly-dedicated
x86 core...  resuting in 600MB/s or something like that?


Actually it's 600-700MB/s out of a 1+1GB/s slot or 1.5GB/s with two 
cards in a 2+2GB/s slot. I suspect that's pretty close to the PCIe 
limit. IIRC they have 22 NAND channels at 40MB/s (theoretical peak) 
each, which is 880MB/s. I agree that their CPU efficiency is not great, 
but cores are supposed to be cheap these days.


Wes Felter

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size


On 05/01/2010 23:31, Michael Herf wrote:

The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be
used instead of netapp. But this certainly reduces the cost advantage
significantly.
   


This is true to some extent. I didn't want to bring it up as I wanted to 
focus only on technical aspect.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size