Re: [zfs-discuss] Best practice for setting ACL

2010-09-13 Thread Craig Stevenson
So Cindy, Simon (or anyone else)... now that we are over a year past when Simon 
wrote his excellent blog introduction, is there an updated best practices for 
ACLs with CIFS?  Or, is this blog entry still the best word on the street?

In my case, I am supporting multiple PCs (Workgroup) and Macs; running 
OpenSolaris B134.

Thanks,
Craig
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to migrate to 4KB sector drives?

2010-09-13 Thread Casper . Dik

On Sun, Sep 12, 2010 at 10:07 AM, Orvar Korvar
knatte_fnatte_tja...@yahoo.com wrote:
 No replies. Does this mean that you should avoid large drives with 4KB 
 sectors, that is, new dri
ves? ZFS does not handle new drives?

Solaris 10u9 handles 4k sectors, so it might be in a post-b134 release of osol.



Build 118 adds support for 4K sectors with the following putback:

PSARC 2008/769 Multiple disk sector size support.
6710930 Solaris needs to support large sector size hard drive disk

But already in build 38 there is some support for large-sector disks in
ZFS.  6407365 large-sector disk support in ZFS


When new features are added to the current release, it is typically created 
for the next release and then backported to the current release.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Edward Ned Harvey
 From: Richard Elling [mailto:rich...@nexenta.com]
 
 This operational definition of fragmentation comes from the single-
 user,
 single-tasking world (PeeCees). In that world, only one thread writes
 files
 from one application at one time. In those cases, there is a reasonable
 expectation that a single file's blocks might be contiguous on a single
 disk.
 That isn't the world we live in, where have RAID, multi-user, or multi-
 threaded
 environments.

I don't know what you're saying, but I'm quite sure I disagree with it.

Regardless of multithreading, multiprocessing, it's absolutely possible to
have contiguous files, and/or file fragmentation.  That's not a
characteristic which depends on the threading model.

Also regardless of raid, it's possible to have contiguous or fragmented
files.  The same concept applies to multiple disks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file recovery on lost RAIDZ array

2010-09-13 Thread Orvar Korvar
That sounds strange. What happened? You used raidz1?

You can mount your zpool into an earlier snapshot. Have you tried that? Or, you 
can mount your pool within the last 30 seconds or so, I think.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hang on zpool import (dedup related)

2010-09-13 Thread Pawel Jakub Dawidek
On Sun, Sep 12, 2010 at 11:24:06AM -0700, Chris Murray wrote:
 Absolutely spot on George. The import with -N took seconds.
 
 Working on the assumption that esx_prod is the one with the problem, I bumped 
 that to the bottom of the list. Each mount was done in a second:
 
 # zfs mount zp
 # zfs mount zp/nfs
 # zfs mount zp/nfs/esx_dev
 # zfs mount zp/nfs/esx_hedgehog
 # zfs mount zp/nfs/esx_meerkat
 # zfs mount zp/nfs/esx_meerkat_dedup
 # zfs mount zp/nfs/esx_page
 # zfs mount zp/nfs/esx_skunk
 # zfs mount zp/nfs/esx_temp
 # zfs mount zp/nfs/esx_template
 
 And those directories have the content in them that I'd expect. Good!
 
 So now I try to mount esx_prod, and the influx of reads has started in   
 zpool iostat zp 1
 
 This is the filesystem with the issue, but what can I do now?

You could try to snapshot it (but keep it unmounted), then zfs send it
and zfs recv it to eg. zp/foo. Use -u option for zfs recv too, then try
to mount what you received.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpQxyW0TDNO3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Orvar Korvar
I was thinking to delete all zfs snapshots before zfs send receive to another 
new zpool. Then everything would be defragmented, I thought.


(I assume snapshots works this way: I snapshot once and do some changes, say 
delete file A and edit file B. When I delete the snapshot, the file A is 
still deleted and file B is still edited. In other words, deletion of 
snapshot does not revert back the changes. Therefore I just delete all 
snapshots and make my filesystem up to date before zfs send receive)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
 
 I was thinking to delete all zfs snapshots before zfs send receive to
 another new zpool. Then everything would be defragmented, I thought.

You don't need to delete snaps before zfs send, if your goal is to
defragment your filesystem.  Just perform a single zfs send, and don't do
any incrementals afterward.  The receiving filesystem will layout the
filesystem as it wishes.


 (I assume snapshots works this way: I snapshot once and do some
 changes, say delete file A and edit file B. When I delete the
 snapshot, the file A is still deleted and file B is still edited.
 In other words, deletion of snapshot does not revert back the changes.

You are correct.

A snapshot is a read-only image of the filesystem, as it was, at some time
in the past.  If you destroy the snapshot, you've only destroyed the
snapshot.  You haven't destroyed the most recent live version of the
filesystem.

If you wanted to, you could rollback, which destroys the live version of
the filesystem, and restores you back to some snapshot.  But that is a very
different operation.  Rollback is not at all similar to destroying a
snapshot.  These two operations are basically opposites of each other.

All of this is discussed in the man pages.  I suggest man zpool and man
zfs

Everything you need to know is written there.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Richard Elling
On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote:

 From: Richard Elling [mailto:rich...@nexenta.com]
 
 This operational definition of fragmentation comes from the single-
 user,
 single-tasking world (PeeCees). In that world, only one thread writes
 files
 from one application at one time. In those cases, there is a reasonable
 expectation that a single file's blocks might be contiguous on a single
 disk.
 That isn't the world we live in, where have RAID, multi-user, or multi-
 threaded
 environments.
 
 I don't know what you're saying, but I'm quite sure I disagree with it.
 
 Regardless of multithreading, multiprocessing, it's absolutely possible to
 have contiguous files, and/or file fragmentation.  That's not a
 characteristic which depends on the threading model.

Possible, yes.  Probable, no.  Consider that a file system is allocating
space for multiple, concurrent file writers.

 Also regardless of raid, it's possible to have contiguous or fragmented
 files.  The same concept applies to multiple disks.

RAID works against the efforts to gain performance by contiguous access
because the access becomes non-contiguous.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Edward Ned Harvey
 From: Richard Elling [mailto:rich...@nexenta.com]
 
  Regardless of multithreading, multiprocessing, it's absolutely
 possible to
  have contiguous files, and/or file fragmentation.  That's not a
  characteristic which depends on the threading model.
 
 Possible, yes.  Probable, no.  Consider that a file system is
 allocating
 space for multiple, concurrent file writers.

Process A is writing.  Suppose it starts writing at block 10,000 out of my
1,000,000 block device.
Process B is also writing.  Suppose it starts writing at block 50,000.

These two processes write simultaneously, and no fragmentation occurs,
unless Process A writes more than 40,000 blocks.  In that case, A's file
gets fragmented, and the 2nd fragment might begin at block 300,000.

The concept which causes fragmentation (not counting COW) in the size of the
span of unallocated blocks.  Most filesystems will allocate blocks from the
largest unallocated contiguous area of the physical device, so as to
minimize fragmentation.

I can't say how ZFS behaves authoritatively, but I'd be extremely surprised
if two processes writing different files as fast as possible result in all
their blocks interleaved with each other on physical disk.  I think this is
possible if you have multiple processes lazily writing at less-than full
speed, because then ZFS might remap a bunch of small writes into a single
contiguous write.


  Also regardless of raid, it's possible to have contiguous or
 fragmented
  files.  The same concept applies to multiple disks.
 
 RAID works against the efforts to gain performance by contiguous access
 because the access becomes non-contiguous.

These might as well have been words randomly selected from the dictionary to
me - I recognize that it's a complete sentence, but you might have said
processors aren't needed in computers anymore, or something equally
illogical.

Suppose you have a 3-disk raid stripe set, using traditional simple
striping, because it's very easy to explain.  Suppose a process is writing
as fast as it can, and suppose it's going to write block 0 through block 99
of a virtual device.

virtual block 0 = block 0 of disk 0
virtual block 1 = block 0 of disk 1
virtual block 2 = block 0 of disk 2
virtual block 3 = block 1 of disk 0
virtual block 4 = block 1 of disk 1
virtual block 5 = block 1 of disk 2
virtual block 6 = block 2 of disk 0
virtual block 7 = block 2 of disk 1
virtual block 8 = block 2 of disk 2
virtual block 9 = block 3 of disk 0
...
virtual block 96 = block 32 of disk 0
virtual block 97 = block 32 of disk 1
virtual block 98 = block 32 of disk 2
virtual block 99 = block 33 of disk 0

Thanks to buffering and command queueing, the OS tells the RAID controller
to write blocks 0-8, and the raid controller tells disk 0 to write blocks
0-2, tells disk 1 to write blocks 0-2, and tells disk 2 to write 0-2,
simultaneously.  So the total throughput is the sum of all 3 disks writing
continuously and contiguously to sequential blocks.

This accelerates performance for continuous sequential writes.  It does not
work against efforts to gain performance by contiguous access.

The same concept is true for raid-5 or raidz, but it's more complicated.
The filesystem or raid controller does in fact know how to write sequential
filesystem blocks to sequential physical blocks on the physical devices for
the sake of performance enhancement on contiguous read/write.

If you don't believe me, there's a very easy test to prove it:

Create a zpool with 1 disk in it.  time writing 100G (or some amount of data
 larger than RAM.)
Create a zpool with several disks in a raidz set, and time writing 100G.
The speed scales up linearly with the number of disks, until you reach some
other hardware bottleneck, such as bus speed or something like that.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS archive image

2010-09-13 Thread Buck Huffman
I have a flash archive that is stored in a ZFS snapshot stream.  Is there a way 
to mount this image so I can read files from it.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS archive image

2010-09-13 Thread Lori Alt

On 09/13/10 09:40 AM, Buck Huffman wrote:

I have a flash archive that is stored in a ZFS snapshot stream.  Is there a way 
to mount this image so I can read files from it.
   
No, but you can use the flar split command to split the flash archive 
into its constituent parts, one of which will be a zfs send stream than 
you can unpack with zfs recv.


Lori

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intermittent ZFS hang

2010-09-13 Thread Charles J. Knipe
 div id=jive-html-wrapper-div
 
 Charles,br
 br
 Just like UNIX, there are several ways to drill down
 on the problem.nbsp; I
 would probably start with a live crash dump (savecore
 -L) when you see
 the problem.nbsp; Another method would be to grap
 multiple stats commands
 during the problem to see where you can drill down
 later.nbsp; I would
 probably use this method if the problem lasts for a
 while and drill
 down with dtrace base on what I saw.nbsp; But each
 method is going to
 depend on your skill, when looking at the
 problem.br
 br
 Davebr
 br

Dave,br
br
After running clean since my last post the problem occurred again today.  This 
time I was able to gather some data while it was going on.  The only thing that 
jumps out at my so far is the output of echo ::zio_state | mdb -k.
br
Under normal operations this usually looks like this:br
br
ADDRESS  TYPE  STAGEWAITERbr
br
ff090eb69328 NULL  OPEN -br
ff090eb69c88 NULL  OPEN -br
br
Here are a couple samples while the issue was happening:br
br
ADDRESS  TYPE  STAGEWAITERbr
br
ff0bfe8c59b0 NULL  CHECKSUM_VERIFY  
ff003e2f2c60br
ff090eb69328 NULL  OPEN -br
ff090eb69c88 NULL  OPEN -br
br
ADDRESS  TYPE  STAGEWAITERbr
br
ff09bb12a040 NULL  CHECKSUM_VERIFY  
ff003d6acc60br
ff0bfe8c59b0 NULL  CHECKSUM_VERIFY  
ff003e2f2c60br
ff090eb69328 NULL  OPEN -br
ff090eb69c88 NULL  OPEN -br
br
Operating under the assumption that the waiter column is referencing kernel 
threads, I went looking for those addresses in the thread list.  Here are the 
threadlist entries for ff003d6acc60 and ff003e2f2c60 from the example 
directly above taken at about the same time as that output:br
br
ff003d6acc60 ff0930d8c700 ff09172f9de0   2   0 ff09bb12a348br
  PC: _resume_from_idle+0xf1CMD: zpool-pool0br
  stack pointer for thread ff003d6acc60: ff003d6ac360br
  [ ff003d6ac360 _resume_from_idle+0xf1() ]br
swtch+0x145()br
cv_wait+0x61()br
zio_wait+0x5d()br
dbuf_read+0x1e8()br
dmu_buf_hold+0x93()br
zap_get_leaf_byblk+0x56()br
zap_deref_leaf+0x78()br
fzap_length+0x42()br
zap_length_uint64+0x84()br
ddt_zap_lookup+0x4b()br
ddt_object_lookup+0x6d()br
ddt_lookup+0x115()br
zio_ddt_free+0x42()br
zio_execute+0x8d()br
taskq_thread+0x248()br
thread_start+8()br
br
ff003e2f2c60 fbc2dbb00   0  60 ff0bfe8c5cb8br
  PC: _resume_from_idle+0xf1THREAD: txg_sync_thread()br
  stack pointer for thread ff003e2f2c60: ff003e2f2a40br
  [ ff003e2f2a40 _resume_from_idle+0xf1() ]br
swtch+0x145()br
cv_wait+0x61()br
zio_wait+0x5d()br
spa_sync+0x40c()br
txg_sync_thread+0x24a()br
thread_start+8()br
br
Not sure if any of that sheds any light on the problem.  I also have a live 
dump from the period when the problem was happening, a bunch of iostats, 
mpstats, and ::arc, ::spa, ::zio_state, and ::threadlist -v from mdb -k at 
several points during the issue.br
br
If you have any advice on how to proceed from here in debugging this issue I'd 
greatly appreciate it.  So you know, I'm generally very comfortable with unix, 
but dtrace and the solaris kernel are unfamiliar territory. br 
br
In any event, thanks again for all the help thus far.br
br
-Charles
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread David Dyer-Bennet

On Mon, September 13, 2010 07:14, Edward Ned Harvey wrote:
 From: Richard Elling [mailto:rich...@nexenta.com]

 This operational definition of fragmentation comes from the single-
 user,
 single-tasking world (PeeCees). In that world, only one thread writes
 files
 from one application at one time. In those cases, there is a reasonable
 expectation that a single file's blocks might be contiguous on a single
 disk.
 That isn't the world we live in, where have RAID, multi-user, or multi-
 threaded
 environments.

 I don't know what you're saying, but I'm quite sure I disagree with it.

 Regardless of multithreading, multiprocessing, it's absolutely possible to
 have contiguous files, and/or file fragmentation.  That's not a
 characteristic which depends on the threading model.

 Also regardless of raid, it's possible to have contiguous or fragmented
 files.  The same concept applies to multiple disks.

The attitude that it *matters* seems to me to have developed, and be
relevant only to, single-user computers.

Regardless of whether a file is contiguous or not, by the time you read
the next chunk of it, in the multi-user world some other user is going to
have moved the access arm of that drive.  Hence, it doesn't matter if the
file is contiguous or not.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Proper procedure when device names have changed

2010-09-13 Thread Brian
I am running zfs-fuse on an Ubuntu 10.04 box.  I have a dual mirrored pool:
mirror sdd sde mirror sdf sdg

Recently the device names shifted on my box and the devices are now sdc sdd sde 
and sdf.  The pool is of course very unhappy about the mirrors are no longer 
matched up and one device is missing.  What is the proper procedure to deal 
with this?

-brian
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proper procedure when device names have changed

2010-09-13 Thread LaoTsao 老曹


try export and import the zpool

On 9/13/2010 1:26 PM, Brian wrote:

I am running zfs-fuse on an Ubuntu 10.04 box.  I have a dual mirrored pool:
mirror sdd sde mirror sdf sdg

Recently the device names shifted on my box and the devices are now sdc sdd sde and sdf.  
The pool is of course very unhappy about the mirrors are no longer matched up and one 
device is missing.  What is the proper procedure to deal with this?

-brian
attachment: laotsao.vcf___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Orvar Korvar
To summarize, 

A) resilver does not defrag.

B) zfs send receive to a new zpool means it will be defragged

Correctly understood?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proper procedure when device names have changed

2010-09-13 Thread Brian
That seems to have done the trick.  I was worried because in the past I've had 
problems importing faulted file systems.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intermittent ZFS hang

2010-09-13 Thread Roy Sigurd Karlsbakk
 At first we blamed de-dupe, but we've disabled that. Next we suspected
 the SSD log disks, but we've seen the problem with those removed, as
 well.

Did you have dedup enabled and then disabled it? If so, data can (or will) be 
deduplicated on the drives. Currently the only way of de-deduping them is to 
recopy them after disabling dedup.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Richard Elling
On Sep 13, 2010, at 10:54 AM, Orvar Korvar wrote:

 To summarize, 
 
 A) resilver does not defrag.
 
 B) zfs send receive to a new zpool means it will be defragged

Define fragmentation?

If you follow the wikipedia definition of defragmentation then the 
answer is no, zfs send/receive does not change the location of files.
Why? Because zfs sends objects, not files.  The objects can be 
allocated in a (more) contiguous form on the receiving side, or maybe
not, depending on the configuration and use of the receiving side. 

A file may be wholly contained in an object, or not, depending on how it
was created. For example, if a file is less than 128KB (by default) and
is created at one time, then it will be wholly contained in one object.
By contrast, UFS has an 8KB max block size will use up to 16 different
blocks to store the same file. These blocks may or may not be contiguous
in UFS.

http://en.wikipedia.org/wiki/Defragmentation

 Correctly understood?

Clear as mud.  I suggest deprecating the use of the term defragmentation.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [mdb-discuss] mdb -k - I/O usage

2010-09-13 Thread Piotr Jasiukajtis
This is snv_128 x86.

 ::arc
hits  =  39811943
misses=630634
demand_data_hits  =  29398113
demand_data_misses=490754
demand_metadata_hits  =  10413660
demand_metadata_misses=133461
prefetch_data_hits= 0
prefetch_data_misses  = 0
prefetch_metadata_hits=   170
prefetch_metadata_misses  =  6419
mru_hits  =   2933011
mru_ghost_hits= 43202
mfu_hits  =  36878818
mfu_ghost_hits= 45361
deleted   =   1299527
recycle_miss  = 46526
mutex_miss=   355
evict_skip= 25539
evict_l2_cached   = 0
evict_l2_eligible = 77011188736
evict_l2_ineligible   =  76253184
hash_elements =278135
hash_elements_max =279843
hash_collisions   =   1653518
hash_chains   = 75135
hash_chain_max= 9
p =  4787 MB
c =  5722 MB
c_min =   715 MB
c_max =  5722 MB
size  =  5428 MB
hdr_size  =  56535840
data_size = 5158287360
other_size= 477726560
l2_hits   = 0
l2_misses = 0
l2_feeds  = 0
l2_rw_clash   = 0
l2_read_bytes = 0
l2_write_bytes= 0
l2_writes_sent= 0
l2_writes_done= 0
l2_writes_error   = 0
l2_writes_hdr_miss= 0
l2_evict_lock_retry   = 0
l2_evict_reading  = 0
l2_free_on_write  = 0
l2_abort_lowmem   = 0
l2_cksum_bad  = 0
l2_io_error   = 0
l2_size   = 0
l2_hdr_size   = 0
memory_throttle_count = 0
arc_no_grow   = 0
arc_tempreserve   = 0 MB
arc_meta_used =  1288 MB
arc_meta_limit=  1430 MB
arc_meta_max  =  1288 MB

 ::memstat
Page SummaryPagesMB  %Tot
     
Kernel 789865  3085   19%
ZFS File Data 1406055  5492   34%
Anon   396297  15489%
Exec and libs7178280%
Page cache   8428320%
Free (cachelist)   117928   4603%
Free (freelist)   1464224  5719   35%

Total 4189975 16367
Physical  4189974 16367


 ::spa -ev
ADDR STATE NAME
ff04f0eb4500ACTIVE data

ADDR STATE AUX  DESCRIPTION
ff04f2f52940 HEALTHY   -root

   READWRITE FREECLAIMIOCTL
OPS   00000
BYTES 00000
EREAD 0
EWRITE0
ECKSUM0

ff050a2fd980 HEALTHY   -  raidz

   READWRITE FREECLAIMIOCTL
OPS 0x57090 0x37436a000
BYTES   0x8207f3c00  0x22345d0800000
EREAD 0
EWRITE0
ECKSUM0

ff050a2fa0c0 HEALTHY   -/dev/dsk/c7t2d0s0

   READWRITE FREECLAIMIOCTL
OPS 0x4416e 0x10564000  0x74326
BYTES   0x10909da00  0x45089d600000
EREAD 0
EWRITE0
ECKSUM0

ff050a2fa700 HEALTHY   -/dev/dsk/c7t3d0s0

   READWRITE FREECLAIMIOCTL
OPS 0x43fca 0x1055fa00  0x74326
BYTES   0x108e14400  0x45087a400000
EREAD 0
EWRITE0
ECKSUM0

ff050a2fad40 HEALTHY   -/dev/dsk/c7t4d0s0

   READWRITE FREECLAIMIOCTL
OPS 0x44221 0x10553300  0x74326
BYTES   0x108a56c00  0x4508c8a00000
EREAD 0
EWRITE0
ECKSUM0

ff050a2fb380 HEALTHY   - 

Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-13 Thread Hatish Narotam
Makes sense. My understanding is not good enough to confidently make my own
decisions, and I'm learning as Im going. The BPG says:

   - The recommended number of disks per group is between 3 and 9. If you
   have more disks, use multiple groups

If there was a reason leading up to this statement, I didnt follow it.

However, a few paragraphs later, their RaidZ2 example says [4x(9+2), 2 hot
spares, 18.0 TB]. So I guess 8+2 should be quite acceptable, especially
since performance is the lowest priority.



On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey sh...@nedharvey.comwrote:

  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of hatish
 
  I have just
  read the Best Practices guide, and it says your group shouldnt have  9
  disks.

 I think the value you can take from this is:
 Why does the BPG say that?  What is the reasoning behind it?

 Anything that is a rule of thumb either has reasoning behind it (you
 should know the reasoning) or it doesn't (you should ignore the rule of
 thumb, dismiss it as myth.)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [mdb-discuss] onnv_142 - vfs_mountroot: cannot mount root

2010-09-13 Thread Gavin Maltby

On 09/07/10 23:26, Piotr Jasiukajtis wrote:

Hi,

After upgrade from snv_138 to snv_142 or snv_145 I'm unable to boot the system.
Here is what I get.

Any idea why it's not able to import rpool?

I saw this issue also on older builds on a different machines.


This sounds (based on the presence of cpqary) not unlike:

6972328 Installation of snv_139+ on HP BL685c G5 fails due to panic during auto 
install process

which was introduced into onnv_139 by the fix for this

6927876 For 4k sector support, ZFS needs to use DKIOCGMEDIAINFOEXT

The fix is in onnv_148 after the external push switch-off, fixed via

6967658 sd_send_scsi_READ_CAPACITY_16() needs to handle SBC-2 and SBC-3 
response formats

I experienced this on data pools rather than the rpool, but I suspect on the 
rpool
you'd get the vfs_mountroot panic you see when rpool import fails.  My 
workaround
was to compile a zfs with the fix for 6927876 changed to force the default
physical block size of 512 and drop that into the BE before booting to it.
There was no simpler workaround available.

Gavin
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-13 Thread Hatish Narotam
Mattias, what you say makes a lot of sense. When I saw *Both of the above
situations resilver in equal time*, I was like no way! But like you said,
assuming no bus bottlenecks.

This is my exact breakdown (cheap disks on cheap bus :P) :

PCI-E 8X 4-port ESata Raid Controller.
4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the
controller).
20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).

The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each
ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each
PM can give up to 3Gbps, which is shared amongst 5 drives. According to
Samsungs site, max read speed is 250MBps, which translates to 2Gbps.
Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM's capability.
So the drives arent likely to hit max read speed for long lengths of time,
especially during rebuild time.

So the bus is going to be quite a bottleneck. Lets assume that the drives
are 80% full. Thats 800GB that needs to be read on each drive, which is
(800x9) 7.2TB.
Best case scenario, we can read 7.2TB at 3Gbps
= 57.6 Tb at 3Gbps
= 57600 Gb at 3Gbps
= 19200 seconds
= 320 minutes
= 5 Hours 20 minutes.

Even if it takes twice that amount of time, Im happy.

Initially I had been thinking 2 PM's for each vdev. But now Im thinking
maybe split it wide as best I can ([2disks per PM] x 2, [3disks per PM] x 2)
for each vdev. It'll give the best possible speed, but still wont max out
the HDD's.

I've never actually sat and done the math before. Hope its decently accurate
:)

On Wed, Sep 8, 2010 at 3:27 PM, Edward Ned Harvey sh...@nedharvey.comwrote:

  From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of
  Mattias Pantzare
 
  It
  is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
  vdev you have to read half the data compared to 1 vdev to resilver a
  disk.

 Let's suppose you have 1T of data.  You have 12-disk raidz2.  So you have
 approx 100G on each disk, and you replace one disk.  Then 11 disks will
 each
 read 100G, and the new disk will write 100G.

 Let's suppose you have 1T of data.  You have 2 vdev's that are each 6-disk
 raidz1.  Then we'll estimate 500G is on each vdev, so each disk has approx
 100G.  You replace a disk.  Then 5 disks will each read 100G, and 1 disk
 will write 100G.

 Both of the above situations resilver in equal time, unless there is a bus
 bottleneck.  21 disks in a single raidz3 will resilver just as fast as 7
 disks in a raidz1, as long as you are avoiding the bus bottleneck.  But 21
 disks in a single raidz3 provides better redundancy than 3 vdev's each
 containing a 7 disk raidz1.

 In my personal experience, approx 5 disks can max out approx 1 bus.  (It
 actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks
 on a good bus, or good disks on a crap bus, but generally speaking people
 don't do that.  Generally people get a good bus for good disks, and cheap
 disks for crap bus, so approx 5 disks max out approx 1 bus.)

 In my personal experience, servers are generally built with a separate bus
 for approx every 5-7 disk slots.  So what it really comes down to is ...

 Instead of the Best Practices Guide saying Don't put more than ___ disks
 into a single vdev, the BPG should say Avoid the bus bandwidth bottleneck
 by constructing your vdev's using physical disks which are distributed
 across multiple buses, as necessary per the speed of your disks and buses.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-13 Thread Hatish Narotam
A, I see. But I think your math is a bit out:

62.5e6 iop @ 100iops
= 625000 seconds
= 10416m
= 173h
= 7D6h.

So 7 days  6 hours. Thats long, but I can live with it. This isnt for an
enterprise environment. While the length of time is of worry in terms of
increasing the chance another drive will fail, in my mind that is mitigated
by the fact that the drives wont be under major stress during that time. Its
a workable solution.

On Thu, Sep 9, 2010 at 3:03 PM, Erik Trimble erik.trim...@oracle.comwrote:

  On 9/9/2010 5:49 AM, hatish wrote:

 Very interesting...

 Well, lets see if we can do the numbers for my setup.

  From a previous post of mine:

 [i]This is my exact breakdown (cheap disks on cheap bus :P) :


 PCI-E 8X 4-port ESata Raid Controller.
 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the
 controller).
 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).

 The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each
 ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each
 PM can give up to 3Gbps, which is shared amongst 5 drives. According to
 Samsungs site, max read speed is 250MBps, which translates to 2Gbps.
 Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM's capability.
 So the drives arent likely to hit max read speed for long lengths of time,
 especially during rebuild time.

 So the bus is going to be quite a bottleneck. Lets assume that the drives
 are 80% full. Thats 800GB that needs to be read on each drive, which is
 (800x9) 7.2TB.
 Best case scenario, we can read 7.2TB at 3Gbps
 = 57.6 Tb at 3Gbps
 = 57600 Gb at 3Gbps
 = 19200 seconds
 = 320 minutes
 = 5 Hours 20 minutes.

 Even if it takes twice that amount of time, Im happy.

 Initially I had been thinking 2 PM's for each vdev. But now Im thinking
 maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks1Pdisk per
 PM] x 2) for each vdev. It'll give the best possible speed, but still wont
 max out the HDD's.

 I've never actually sat and done the math before. Hope its decently
 accurate :)[/i]

 My scenario, as from Erik's post:
 Scenario: I have 10 1TB disks in a raidz2, and I have 128k
 slab sizes. Thus, I have 16k of data for each slab written to each
 disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS
 gets to reconstruct 16k of data on the failed drive. It thus takes
 about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive.

 Lets assume the drives are at 95% capacity, which is a pretty bad
 scenario. So thats 7600GB, which is 60800Gb. There will be no other IO while
 a rebuild is going.
 Best Case: I'll read at 12Gbps,  write at 3Gbps (4:1). I read 128K for
 every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So
 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more
 realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is
 11h15m33s. Which is a more realistic time to read 7.6TB.



 Actually, your biggest bottleneck will be the IOPS limits of the drives.  A
 7200RPM SATA drive tops out at 100 IOPS.  Yup. That's it.

 So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100
 IOPS, that means you will finish (best case) in 62.5e4 seconds.  Which is
 over 173 hours. Or, about 7.25 WEEKS.


 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-13 Thread Craig Cory
Run away! Run fast little Netapp. Don't anger the sleeping giant - Oracle!



David Magda wrote:
 Seems that things have been cleared up:

 NetApp (NASDAQ: NTAP) today announced that both parties have agreed to
 dismiss their pending patent litigation, which began in 2007 between Sun
 Microsystems and NetApp. Oracle and NetApp seek to have the lawsuits
 dismissed without prejudice. The terms of the agreement are confidential.

 http://tinyurl.com/39qkzgz
 http://www.netapp.com/us/company/news/news-rel-20100909-oracle-settlement.html

 A recap of the history at:

 http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



-- 
Craig Cory
 Senior Instructor :: ExitCertified
 : Sun Certified System Administrator
 : Sun Certified Network Administrator
 : Sun Certified Security Administrator
 : Veritas Certified Instructor

 8950 Cal Center Drive
 Bldg 1, Suite 110
 Sacramento, California  95826
 [e] craig.c...@exitcertified.com
 [p] 916.669.3970
 [f] 916.669.3977

+-+
 ExitCertified :: Excellence in IT Certified Education

  Certified training with Oracle, Sun Microsystems, Apple, Symantec, IBM,
   Red Hat, MySQL, Hitachi Storage, SpringSource and VMWare.

 1.800.803.EXIT (3948)  |  www.ExitCertified.com
+-+
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-13 Thread Hatish Narotam
Hi,

*The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each
ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller.*

I was simply listing the bandwidth available at the different stages of the
data cycle. The PCIE port gives me 32Gbps. The Sata card gives me a possible
12Gbps. I'd rather be cautious and asuume I'll get more like 6Gbps, it is a
cheap card after all.

*I guarantee you this is not a sustainable speed for 7.2krpm sata disks.* (I
am well aware :) )

* Which is 333% of the PM's capability. *

Assuming that it is, 5 drives at that speed will max out my PM 3 times over.
So my PM will automatically throttle the drives speed to a third of that on
the account that the PM will be maxed out.

Thanks for the rough IO speed check :)


On Thu, Sep 9, 2010 at 3:20 PM, Edward Ned Harvey sh...@nedharvey.comwrote:

  From: Hatish Narotam [mailto:hat...@gmail.com]
 
  PCI-E 8X 4-port ESata Raid Controller.
  4 x ESata to 5Sata Port multipliers (each connected to a ESata port on
  the controller).
  20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).

 Assuming your disks can all sustain 500Mbit/sec, which I find to be typical
 for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit
 upstream bottleneck, it means each of your groups of 5 should be fine in a
 raidz1 configuration.

 You think that your sata card can do 32Gbit because it's on a PCIe x8 bus.
 I highly doubt it unless you paid a grand or two for your sata controller,
 but please prove me wrong.  ;-)  I think the backplane of the sata
 controller is more likely either 3G or 6G.

 If it's 3G, then you should use 4 groups of raidz1.
 If it's 6G, then you can use 2 groups of raidz2 (because 10 drives of
 500Mbit can only sustain 5Gbit)
 If it's 12G or higher, then you can make all of your drives one big vdev of
 raidz3.


  According to Samsungs site, max read speed is 250MBps, which
  translates to 2Gbps. Multiply by 5 drives gives you 10Gbps.

 I guarantee you this is not a sustainable speed for 7.2krpm sata disks.
  You
 can get a decent measure of sustainable speed by doing something like:
(write 1G byte)
time dd if=/dev/zero of=/some/file bs=1024k count=1024
(beware: you might get an inaccurate speed measurement here
due to ram buffering.  See below.)

(reboot to ensure nothing is in cache)
(read 1G byte)
time dd if=/some/file of=/dev/null bs=1024k
(Now you're certain you have a good measurement.
If it matches the measurement you had before,
that means your original measurement was also
accurate.  ;-) )


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool upgrade and zfs upgrade behavior on b145

2010-09-13 Thread Chris Mosetick
Not sure what the best list to send this to is right now, so I have selected
a few, apologies in advance.

A couple questions.  First I have a physical host (call him bob) that was
just installed with b134 a few days ago.  I upgraded to b145 using the
instructions on the Illumos wiki yesterday.  The pool has been upgraded (27)
and the zfs file systems have been upgraded (5).

ch...@bob:~# zpool upgrade rpool
This system is currently running ZFS pool version 27.
Pool 'rpool' is already formatted using the current version.

ch...@bob:~# zfs upgrade rpool
7 file systems upgraded

The file systems have been upgraded according to zfs get version rpool

Looks ok to me.

However, I now get an error when I run zdb -D.  I can't remember exactly
when I turned dedup on, but I moved some data on rpool, and zpool list
shows 1.74x ratio.

ch...@bob:~# zdb -D rpool
zdb: can't open 'rpool': No such file or directory

Also, running zdb by itself, returns expected output, but still says my
rpool is version 22.  Is that expected?

I never ran zdb before the upgrade, since it was a clean install from the
b134 iso to go straight to b145.  One thing I will mention is that the
hostname of the machine was changed too (using these
instructionshttp://wiki.genunix.org/wiki/index.php/Change_hostname_HOWTO).
bob used to be eric.  I don't know if that matters, but I can't open up the
Users and Groups from Gnome anymore, *unable to su* so something is
still not right there.

Moving on, I have another fresh install of b134 from iso inside a virtualbox
virtual machine, on a total different physical machine.  This machine is
named weston and was upgraded to b145 using the same Illumos wiki
instructions.  His name has never changed.  When I run the same zdb -D
command I get the expected output.

ch...@weston:~# zdb -D rpool
DDT-sha256-zap-unique: 11 entries, size 558 on disk, 744 in core
dedup = 1.00, compress = 7.51, copies = 1.00, dedup * compress / copies =
7.51

However, after zpool and zfs upgrades *on both machines*, they still say the
rpool is version 22.  Is that expected/correct?  I added a new virtual disk
to the vm weston to see what would happen if I made a new pool on the new
disk.

ch...@weston:~# zpool create test c5t1d0

Well, the new test pool shows version 27, but rpool is still listed at 22
by zdb.  Is this expected /correct behavior?  See the output below to see
the rpool and test pool version numbers according to zdb on the host weston.


Can anyone provide any insight into what I'm seeing?  Do I need to delete my
b134 boot environments for rpool to show as version 27 in zdb?  Why does zdb
-D rpool give me can't open on the host bob?

Thank you in advance,

-Chris

ch...@weston:~# zdb
rpool:
version: 22
name: 'rpool'
state: 0
txg: 7254
pool_guid: 17616386148370290153
hostid: 8413798
hostname: 'weston'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 17616386148370290153
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 14826633751084073618
path: '/dev/dsk/c5t0d0s0'
devid: 'id1,s...@sata_vbox_harddiskvbf6ff53d9-49330fdb/a'
phys_path: '/p...@0,0/pci8086,2...@d/d...@0,0:a'
whole_disk: 0
metaslab_array: 23
metaslab_shift: 28
ashift: 9
asize: 32172408832
is_log: 0
create_txg: 4
test:
version: 27
name: 'test'
state: 0
txg: 26
pool_guid: 13455895622924169480
hostid: 8413798
hostname: 'weston'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 13455895622924169480
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 7436238939623596891
path: '/dev/dsk/c5t1d0s0'
devid: 'id1,s...@sata_vbox_harddiskvba371da65-169e72ea/a'
phys_path: '/p...@0,0/pci8086,2...@d/d...@1,0:a'
whole_disk: 1
metaslab_array: 30
metaslab_shift: 24
ashift: 9
asize: 3207856128
is_log: 0
create_txg: 4
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-code] What append to ZFS bp rewrite?

2010-09-13 Thread Will Fiveash
On Fri, Sep 10, 2010 at 08:36:13AM -0700, Steeve Roy wrote:
 I am currently preparing a big SAN deployment using ZFS. As I will start with 
 60tB of data with a growing rate of 25% per year, I need some online defrag, 
 data redistribution against drive as storage pool increase etc...
 
 When can we expect to get the bp rewrite feature into ZFS?
 
 Thanks!

I'm thinking zfs-discuss@opensolaris.org is a better place to ask
(cc'ed).
-- 
Will Fiveash
Oracle
http://opensolaris.org/os/project/kerberos/
Sent using mutt, a sweet text based e-mail app: http://www.mutt.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] What append to ZFS bp rewrite?

2010-09-13 Thread Steeve Roy
I am currently preparing a big SAN deployment using ZFS. As I will start with 
60tB of data with a growing rate of 25% per year, I need some online defrag, 
data redistribution against drive as storage pool increase etc...



When can we expect to get the bp rewrite feature into ZFS?



Thanks!


Steeve Roy
IT Manager
Coveo
2800 St-Jean Baptiste
Suite 212
Québec, Qc G2E 6J5
Office: +1-418-263- ext:330  FAX: +1-418-263-1221
Mobile: +1-418-802-5440
s...@coveo.commailto:s...@coveo.com
www.coveo.comhttp://www.coveo.com
Information Access at the Speed of Business(tm)
---
This email and any attachments thereto may contain private, confidential, and 
privileged material for the sole use of the intended recipient named in the 
original email to which this message was attached. Any review, copying, or 
distribution of this email (or any attachments thereto) by others is strictly 
prohibited. If you are not the intended recipient, please return this email to 
the sender immediately and permanently delete the original and any copies of 
this email and any attachments thereto.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proper procedure when device names have changed

2010-09-13 Thread Robert Mustacchi
Or you can go into udev's persistent rules and set it up such that the 
drives always get the correct names. I'd guess you'll probably find them 
somewhere under /etc/udev/rules.d or something similar. It will likely 
save you trouble in the long run, as they likely are getting shuffled 
with either a kernel or udev upgrade.


Robert

On 9/13/10 10:31 AM, LaoTsao 老曹 wrote:


try export and import the zpool

On 9/13/2010 1:26 PM, Brian wrote:

I am running zfs-fuse on an Ubuntu 10.04 box. I have a dual mirrored
pool:
mirror sdd sde mirror sdf sdg

Recently the device names shifted on my box and the devices are now
sdc sdd sde and sdf. The pool is of course very unhappy about the
mirrors are no longer matched up and one device is missing. What is
the proper procedure to deal with this?

-brian



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intermittent ZFS hang

2010-09-13 Thread Charles J. Knipe
  At first we blamed de-dupe, but we've disabled that. Next we
 suspected
  the SSD log disks, but we've seen the problem with those removed, as
  well.
 
 Did you have dedup enabled and then disabled it? If so, data can (or
 will) be deduplicated on the drives. Currently the only way of de-
 deduping them is to recopy them after disabling dedup.

That's a good point.  There is deduplicated data still present on disk.  Do you 
think the issue we're seeing may be related to the existing deduped data?  I'm 
not against copying the contents of the pool over to a new pool, but 
considering the effort/disruption I'd want to make sure it's not just a shot in 
the dark.

If I don't have a good theory in another week, that's when I start shooting in 
the dark...

-Charles
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Configuration questions for Home File Server (CPU cores, dedup, checksum)?

2010-09-13 Thread David Dyer-Bennet

On Tue, September 7, 2010 15:58, Craig Stevenson wrote:

 3.  Should I consider using dedup if my server has only 8Gb of RAM?  Or,
 will that not be enough to hold the DDT?  In which case, should I add
 L2ARC / ZIL or am I better to just skip using dedup on a home file server?

I would not consider using dedup in the current state of the code.  I hear
too many horror stories.

Also, why do you think you'd get much benefit?  It takes pretty big blocks
of exact bit-for-bit duplication to actually trigger the code, and you're
not going to find them in compressed image (including motion picture /
video) or audio files, for example (the main things that take up much
space on most home servers).
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs compression with Oracle - anyone implemented?

2010-09-13 Thread Brad
Hi!  I'd been scouring the forums and web for admins/users who deployed zfs 
with compression enabled on Oracle backed by storage array luns.
Any problems with cpu/memory overhead?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs compression with Oracle - anyone implemented?

2010-09-13 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Brad
 
 Hi!  I'd been scouring the forums and web for admins/users who deployed
 zfs with compression enabled on Oracle backed by storage array luns.
 Any problems with cpu/memory overhead?

I don't think your question is clear.  What do you mean on oracle backed by
storage luns?

Do you mean on oracle hardware?
Do you mean you plan to run oracle database on the server, with ZFS under
the database?

Generally speaking, you can enable compression on any zfs filesystem, and
the cpu overhead is not very big, and the compression level is not very
strong by default.  However, if the data you have is generally
uncompressible, any overhead is a waste.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS online device management

2010-09-13 Thread Chris Mosetick
Can anyone elaborate on the zpool split command.  I have not seen any
examples in use am I am very curious about it.  Say I have 12 disks in a
pool named tank.  6 in a RAIDZ2 + another 6 in a RAIDZ2. All is well, and
I'm not even close to maximum capacity in the pool.  Say I want to swap out
6 of the 12 SATA disks for faster SAS disks, and make a new 6 disk pool with
just the SAS disks, leaving the existing pool with the SATA disks intact.

Can I run something like:

zpool split tank dozer c4t8d0 c4t9d0 c4t10d0 c4t11d0 c4t12d0 c4t13d0

zpool export dozer

Now, turn off the server, remove the 6 SATA disks.

Put in the 6 SAS disks.

Power on the server.

echo | format   to get the disk ID's of the new SAS disks.

zpool create speed raidz disk1 disk2 disk3 disk4 disk5 disk6

Thanks in advance,

-Chris


On Sat, Sep 11, 2010 at 4:37 PM, besson3c j...@netmusician.org wrote:

 Ahhh, I figured you could always do that, I guess I was wrong...
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS online device management

2010-09-13 Thread Richard Elling
On Sep 13, 2010, at 4:40 PM, Chris Mosetick wrote:

 Can anyone elaborate on the zpool split command.  I have not seen any 
 examples in use am I am very curious about it.  Say I have 12 disks in a pool 
 named tank.  6 in a RAIDZ2 + another 6 in a RAIDZ2. All is well, and I'm not 
 even close to maximum capacity in the pool.  Say I want to swap out 6 of the 
 12 SATA disks for faster SAS disks, and make a new 6 disk pool with just the 
 SAS disks, leaving the existing pool with the SATA disks intact.

zpool split only works on mirrors.

For examples, see the section Creating a New Pool By Splitting a Mirrored ZFS
Storage Pool in the ZFS Admin Guide.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS online device management

2010-09-13 Thread Chris Mosetick
So are there now any methods to achieve the scenario I described to shrink a
pools size with existing ZFS tools?  I don't see a definitive way listed on
the old shrinking
threadhttp://www.opensolaris.org/jive/thread.jspa?threadID=8125
.

Thank you,

-Chris


On Mon, Sep 13, 2010 at 4:55 PM, Richard Elling rich...@nexenta.com wrote:

 On Sep 13, 2010, at 4:40 PM, Chris Mosetick wrote:

  Can anyone elaborate on the zpool split command.  I have not seen any
 examples in use am I am very curious about it.  Say I have 12 disks in a
 pool named tank.  6 in a RAIDZ2 + another 6 in a RAIDZ2. All is well, and
 I'm not even close to maximum capacity in the pool.  Say I want to swap out
 6 of the 12 SATA disks for faster SAS disks, and make a new 6 disk pool with
 just the SAS disks, leaving the existing pool with the SATA disks intact.

 zpool split only works on mirrors.

 For examples, see the section Creating a New Pool By Splitting a Mirrored
 ZFS
 Storage Pool in the ZFS Admin Guide.
  -- richard

 --
 OpenStorage Summit, October 25-27, Palo Alto, CA
 http://nexenta-summit2010.eventbrite.com

 Richard Elling
 rich...@nexenta.com   +1-760-896-4422
 Enterprise class storage for everyone
 www.nexenta.com






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS online device management

2010-09-13 Thread Richard Elling
On Sep 13, 2010, at 5:51 PM, Chris Mosetick wrote:

 So are there now any methods to achieve the scenario I described to shrink a 
 pools size with existing ZFS tools?  I don't see a definitive way listed on 
 the old shrinking thread.

Today, there is no way to accomplish what you want without copying.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file recovery on lost RAIDZ array

2010-09-13 Thread Michael Eskowitz
I don't know what happened.  I was in the process of copying files onto my new 
file server when the copy process from the other machine failed.  I turned on 
the monitor for the fileserver and found that it had rebooted by itself at some 
point (machine fault maybe?) and when I remounted the drives every last thing 
was gone.

I am new to zfs.  How do you take snapshots?  Does the sytem do it 
automagically for you?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file recovery on lost RAIDZ array

2010-09-13 Thread Michael Eskowitz
Oh and yes, raidz1.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file recovery on lost RAIDZ array

2010-09-13 Thread Richard Elling
On Sep 12, 2010, at 7:49 PM, Michael Eskowitz wrote:

 I recently lost all of the data on my single parity raid z array.  Each of 
 the drives was encrypted with the zfs array built within the encrypted 
 volumes.
 
 I am not exactly sure what happened.  

Murphy strikes again!

 The files were there and accessible and then they were all gone.  The server 
 apparently crashed and rebooted and everything was lost.  After the crash I 
 remounted the encrypted drives and the zpool was still reporting that roughly 
 3TB of the 7TB array were used, but I could not see any of the files through 
 the array's mount point.  I unmounted the zpool and then remounted it and 
 suddenly zpool was reporting 0TB were used.  

Were you using zfs send/receive?  If so, then this is the behaviour expected 
when a
session is interrupted. Since the snapshot did not completely arrive at the 
receiver, the
changes are rolled back.  It can take a few minutes for terabytes to be freed.

 I did not remap the virtual device.  The only thing of note that I saw was 
 that the name of storage pool had changed.  Originally it was Movies and 
 then it became Movita.  I am guessing that the file system became corrupted 
 some how.  (zpool status did not report any errors)
 
 So, my questions are these... 
 
 Is there anyway to undelete data from a lost raidz array?

It depends entirely on the nature of the loss.  In the case I describe above, 
there is nothing
lost because nothing was there (!)

  If I build a new virtual device on top of the old one and the drive topology 
 remains the same, can we scan the drives for files from old arrays?

The short answer is no.

 Also, is there any way to repair a corrupted storage pool?

Yes, but it depends entirely on the nature of the corruption.

  Is it possible to backup the file table or whatever partition index zfs 
 maintains?

The ZFS configuration data is stored redundantly in the pool and checksummed.

 I imagine that you all are going to suggest that I scrub the array, but that 
 is not an option at this point.  I had a backup of all of the data lost as I 
 am moving between file servers so at a certain point I gave up and decided to 
 start fresh.  This doesn't give me a warm fuzzy feeling about zfs, though.

AFAICT, ZFS appears to be working as designed.  Are you trying to kill the 
canary? :-)
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Haudy Kazemi

Richard Elling wrote:

On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote:

  

From: Richard Elling [mailto:rich...@nexenta.com]

This operational definition of fragmentation comes from the single-
user,
single-tasking world (PeeCees). In that world, only one thread writes
files
from one application at one time. In those cases, there is a reasonable
expectation that a single file's blocks might be contiguous on a single
disk.
That isn't the world we live in, where have RAID, multi-user, or multi-
threaded
environments.
  

I don't know what you're saying, but I'm quite sure I disagree with it.

Regardless of multithreading, multiprocessing, it's absolutely possible to
have contiguous files, and/or file fragmentation.  That's not a
characteristic which depends on the threading model.



Possible, yes.  Probable, no.  Consider that a file system is allocating
space for multiple, concurrent file writers.
  


With appropriate write caching and grouping or re-ordering of writes 
algorithms, it should be possible to minimize the amount of file 
interleaving and fragmentation on write that takes place.  (Or at least 
optimize the amount of file interleaving.  Years ago MFM hard drives had 
configurable sector interleave factors to better optimize performance 
when no interleaving meant the drive had spun the platter far enough to 
be ready to give the next sector to the CPU before the CPU was ready 
with the result that the platter had to be spun a second time around to 
wait for the CPU to catch up.)




Also regardless of raid, it's possible to have contiguous or fragmented
files.  The same concept applies to multiple disks.



RAID works against the efforts to gain performance by contiguous access
because the access becomes non-contiguous.


From what I've seen, defragmentation offers its greatest benefit when 
the tiniest reads are eliminated by grouping them into larger contiguous 
reads.  Once the contiguous areas reach a certain size (somewhere in the 
few Mbytes to a few hundred Mbytes range), further defragmentation 
offers little additional benefit.  Full defragmentation is a useful goal 
when the option of using file carving based data recovery is desirable.  
Also remember that defragmentation is not limited to space used by 
files.  It can also apply to free, unused space, which should also be 
defragmented to prevent future writes from being fragmented on write.


With regard to multiuser systems and how that negates the need to 
defragment, I think that is only partially true.  As long as the files 
are defragmented enough so that each particular read request only 
requires one seek before it is time to service the next read request, 
further defragmentation may offer only marginal benefit.  On the other 
hand, if files from been fragmented down to each sector being stored 
separately on the drive, then each read request is going to take that 
much longer to be completed (or will be interrupted by another read 
request because it has taken too long)..


-hk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Richard Elling
On Sep 13, 2010, at 9:41 PM, Haudy Kazemi wrote:
 Richard Elling wrote:
 On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote:
 From: Richard Elling [mailto:rich...@nexenta.com
 ]
 
 This operational definition of fragmentation comes from the single-
 user,
 single-tasking world (PeeCees). In that world, only one thread writes
 files
 from one application at one time. In those cases, there is a reasonable
 expectation that a single file's blocks might be contiguous on a single
 disk.
 That isn't the world we live in, where have RAID, multi-user, or multi-
 threaded
 environments.
   
 
 I don't know what you're saying, but I'm quite sure I disagree with it.
 
 Regardless of multithreading, multiprocessing, it's absolutely possible to
 have contiguous files, and/or file fragmentation.  That's not a
 characteristic which depends on the threading model.
 
 
 
 Possible, yes.  Probable, no.  Consider that a file system is allocating
 space for multiple, concurrent file writers.
 
 With appropriate write caching and grouping or re-ordering of writes 
 algorithms, it should be possible to minimize the amount of file interleaving 
 and fragmentation on write that takes place.  

To some degree, ZFS already does this.  The dynamic block sizing tries to ensure
that a file is written into the largest block[1]

 (Or at least optimize the amount of file interleaving.  Years ago MFM hard 
 drives had configurable sector interleave factors to better optimize 
 performance when no interleaving meant the drive had spun the platter far 
 enough to be ready to give the next sector to the CPU before the CPU was 
 ready with the result that the platter had to be spun a second time around to 
 wait for the CPU to catch up.)

Reason #526 why SSDs kill HDDs on performance.

 Also regardless of raid, it's possible to have contiguous or fragmented
 files.  The same concept applies to multiple disks.
 
 
 
 RAID works against the efforts to gain performance by contiguous access
 because the access becomes non-contiguous.
 
 
 From what I've seen, defragmentation offers its greatest benefit when the 
 tiniest reads are eliminated by grouping them into larger contiguous reads.  
 Once the contiguous areas reach a certain size (somewhere in the few Mbytes 
 to a few hundred Mbytes range), further defragmentation offers little 
 additional benefit.

For the wikipedia definition of defragmentation, this can only occur when the 
files
themselves are hundreds of megabytes in size.  This is not the general case for 
which
I see defragmentation used.

Also, ZFS has an intelligent prefetch algorithm that can hide some performance
aspects of defragmentation on HDDs.

  Full defragmentation is a useful goal when the option of using file carving 
 based data recovery is desirable.  Also remember that defragmentation is not 
 limited to space used by files.  It can also apply to free, unused space, 
 which should also be defragmented to prevent future writes from being 
 fragmented on write.

This is why ZFS uses a first fit algorithm until space becomes too low, when it 
changes
to a best fit algorithm. As long as available space is big enough for the 
block, then it will
be used. 
 
 With regard to multiuser systems and how that negates the need to defragment, 
 I think that is only partially true.  As long as the files are defragmented 
 enough so that each particular read request only requires one seek before it 
 is time to service the next read request, further defragmentation may offer 
 only marginal benefit.  On the other hand, if files from been fragmented down 
 to each sector being stored separately on the drive, then each read request 
 is going to take that much longer to be completed (or will be interrupted by 
 another read request because it has taken too long)..

Yes, so try to avoid running your ZFS pool at more than 96% full.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss