date:20070507

Re: Summary: [zfs-discuss] Poor man's backup by attaching/detaching mirror drives on a _striped_ pool?

2007-05-07 Thread Matthew Ahrens


Constantin Gonzalez wrote:

- The supported alternative would be zfs snapshot, then zfs send/receive,
  but this introduces the complexity of snapshot management which
  makes it less simple, thus less appealing to the clone-addicted admin.

...

IMHO, we should investigate if something like zpool clone would be useful.
It could be implemented as a script that recursively snapshots the source
pool, then zfs send/receives it to the destination pool, then copies all
properties, but the actual reason why people do mirror splitting in the
first place is because of its simplicity.

A zpool clone or a zpool send/receive command would be even simpler and less
error-prone than the tradition of splitting mirrors, plus it could be
implemented more efficiently and more reliably than a script, thus bringing
real additional value to administrators.


I agree that this is the best solution.  I am working on zfs send -r 
(RFE filed but id not handy), which will provide the features you 
describe above.


--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Who modified my ZFS receive destination?

2007-05-07 Thread Matthew Ahrens


Constantin Gonzalez wrote:

But at some point, zfs receive says cannot receive: destination has been
modified since most recent snapshot. I am pretty sure nobody changed anything
at my destination filesystem and I also tried rolling back to an earlier
snapshot on the destination filesystem to make it clean again.


As Eric noted, you should use 'zfs recv -F' to do a rollback if 
necessary.  Also, you could use dtrace to figure out when the 
modification occurred, and by whom.  We are also working on 'zfs diffs' 
(RFE filed but id not handy), which would be able to tell you what was 
modified.


--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Benchmarking

2007-05-07 Thread Matthew Ahrens


Anton B. Rang wrote:

I time mkfile'ing a 1 gb file on ufs and copying it [...] then did
the same thing on each zfs partition.  Then I took snapshots,
copied files, more snapshots, keeping timings all the way. [ ... ]

Is this a sufficient, valid test?


If your applications do that -- manipulate large files, primarily
copying them -- then it may be.

If your applications have other access patterns, probably not.  If
you're concerned about whether you should put ZFS into production,
then you should put it onto your test system and run your real
applications on it for a while to qualify it (just as you should for
any other file system or hardware).


I couldn't agree more.

That said, I would be extremely surprised if the presence of snapshots 
or clones had any impact whatsoever on the performance of accessing a 
given filesystem.  I've never seen anything like that.


--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] crash

2007-05-07 Thread Matthew Ahrens


Opensolaris Aserver wrote:

We tried to replicate a snapshot via the built-in send receive zfs tools.

...
ZFS: bad checksum (read on unknown off 0: zio 3017b300 [L0 ZFS 
plain file] 2L/2P DVA[0]=0:3b98ed1e800:25800 fletcher2 
uncompressed LE contiguous birth=806063 fill=1 cksum=a487e32d

...

errors: Permanent errors have been detected in the following files:

stor/[EMAIL PROTECTED]:01:00:/1003/kreos11/HB1030/C_Root/Documents 
and Settings/bvp/My Documents/My 
Pictures/confidential/tconfidential/confidential/96

...
Son we decided to destroy this snapshot, and then started another 
Replication.


This time the server crashed again :-(


So, some of your data has been lost due to hardware failure, where the 
hardware has silently corrupted your data.  ZFS has detected this.  If 
you were to read this data (other than via 'zfs send'), you will get 
EIO, and as you note, 'zfs status' shows what files are affected.


The 'zfs send' protocol isn't able to tell the other side this part of 
this file is corrupt, so it panics.  This is a bug.


The reason you're seeing the panic when 'zfs send'-ing the next snapshot 
is that the (corrupt) data is shared between multiple snapshots.


You can work around this by deleting or overwriting the files, then 
taking and sending a new snapshot.


--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Permanently removing vdevs from a pool

2007-05-07 Thread Matthew Ahrens


Robert Milkowski wrote:

Hello George,

Friday, April 20, 2007, 7:37:52 AM, you wrote:

GW This is a high priority for us and is actively being worked.

GW Vague enough for you. :-) Sorry I can't give you anything more exact 
GW that that.


Can you at least give us feature list being developed?

Some answers to questions like:

1. evacuating a vdev resulting in a smaller pool for all raid configs - ?

2. adding new vdev and rewriting all existing data to new larger
   stripe - ?

3. expanding stripe width for raid-z1 and raid-z2 - ?

4. live conversion between different raid kinds on the same disk set - ?


No, you will not be able to change the number of disks in a raid-z set 
(I think that answers questions 1-4).  There is no plan to implement 
this feature.



5. live data migration from one disk set to another - ?


Yes -- just add the new disk set, then remove the old disk set.


6. rewriting data in a dataset (not entire pool) after changing some
   parameters like compression, encryption, ditto blocks, ... so it
   will affect also already written data in a dataset. This should be
   both pool wise and data set wise - ?


Yes.


7. de-fragmentation of a pool - ?


Yes.


8. anything else ?


--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Permanently removing vdevs from a pool

2007-05-07 Thread Matthew Ahrens


Matty wrote:

On 4/20/07, George Wilson [EMAIL PROTECTED] wrote:

This is a high priority for us and is actively being worked.

Vague enough for you. :-) Sorry I can't give you anything more exact
that that.


Hi George,

If ZFS is supposed to be part of opensolaris, then why can't the
community get additional details?


What additional details would you like?  We are not withholding anything 
-- George answered the question to the best of his knowledge.  We simply 
aren't sure when exactly this feature will be completed.


--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Re: Re: gzip compression throttles system?

2007-05-07 Thread Jürgen Keil

 A couple more questions here.
 
 [mpstat]
 
  CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
  0 0 0 3109 3616 316 196 5 17 48 45 245 0 85 0 15
  1 0 0 3127 3797 592 217 4 17 63 46 176 0 84 0 15
  CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
  0 0 0 3051 3529 277 201 2 14 25 48 216 0 83 0 17
  1 0 0 3065 3739 606 195 2 14 37 47 153 0 82 0 17
  CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
  0 0 0 3011 3538 316 242 3 26 16 52 202 0 81 0 19
  1 0 0 3019 3698 578 269 4 25 23 56 309 0 83 0 17
...
 The largest numbers from mpstat are for interrupts and cross calls.
 What does intrstat(1M) show?

 Have you run dtrace to determine the most frequent cross-callers?

As far as I understand it, we have these frequent cross calls
because 
1. the test was run on an x86 MP machine
2. the kernel zmod / gzip code allocates and frees four big chunks of
   memory (4 * 65544 bytes) per zio_write_compress ( gzip ) call  [1]

Freeing these big memory chunks generates lots of cross calls,
because page table entries for that memory are invalidated on all
cpus (cores).


Of cause this effect cannot be observed on an uniprocessor machine
(one cpu / core).

And apparently it isn't the root cause for the bad interactive
performance with this test;  the bad interactive performance can
also be observed on single cpu/single core x86 machines.


A possible optimization for MP machines:  use some kind of
kmem_cache for the gzip buffers, so that these buffers could
be reused between gzip compression calls.


[1] allocations per zio_write_compress() / gzip_compress() call:

  1   6642 kobj_alloc:entry sz 5936, fl 1001
  1   6642 kobj_alloc:entry sz 65544, fl 1001
  1   6642 kobj_alloc:entry sz 65544, fl 1001
  1   6642 kobj_alloc:entry sz 65544, fl 1001
  1   6642 kobj_alloc:entry sz 65544, fl 1001
  1   5769  kobj_free:entry fffeeb307000: sz 65544
  1   5769  kobj_free:entry fffeeb2f5000: sz 65544
  1   5769  kobj_free:entry fffeeb2e3000: sz 65544
  1   5769  kobj_free:entry fffeeb2d1000: sz 65544
  1   5769  kobj_free:entry fffed1c42000: sz 5936
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Re: Re: gzip compression throttles system?

2007-05-07 Thread Jürgen Keil

 with recent bits ZFS compression is now handled concurrently with  
 many CPUs working on different records.
 So this load will burn more CPUs and acheive it's results  
 (compression) faster.
 
 So the observed pauses should be consistent with that of a load  
 generating high system time.
 The assumption is that compression now goes faster than when is was  
 single threaded.
 
 Is this undesirable ? We might seek a way to slow down compression in  
 order to limit the system load.

According to this dtrace script

#!/usr/sbin/dtrace -s

sdt:genunix::taskq-enqueue
/((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/
{
@where[stack()] = count();
}

tick-5s {
printa(@where);
trunc(@where);
}




... I see bursts of ~ 1000 zio_write_compress() [gzip] taskq calls
enqueued into the spa_zio_issue taskq by zfs`spa_sync() and
its children:

  0  76337 :tick-5s 
...
  zfs`zio_next_stage+0xa1
  zfs`zio_wait_for_children+0x5d
  zfs`zio_wait_children_ready+0x20
  zfs`zio_next_stage_async+0xbb
  zfs`zio_nowait+0x11
  zfs`dbuf_sync_leaf+0x1b3
  zfs`dbuf_sync_list+0x51
  zfs`dbuf_sync_indirect+0xcd
  zfs`dbuf_sync_list+0x5e
  zfs`dbuf_sync_indirect+0xcd
  zfs`dbuf_sync_list+0x5e
  zfs`dnode_sync+0x214
  zfs`dmu_objset_sync_dnodes+0x55
  zfs`dmu_objset_sync+0x13d
  zfs`dsl_dataset_sync+0x42
  zfs`dsl_pool_sync+0xb5
  zfs`spa_sync+0x1c5
  zfs`txg_sync_thread+0x19a
  unix`thread_start+0x8
 1092

  0  76337 :tick-5s 



It seems that after such a batch of compress requests is
submitted to the spa_zio_issue taskq, the kernel is busy
for several seconds working on these taskq entries.
It seems that this blocks all other taskq activity inside the
kernel...



This dtrace script counts the number of 
zio_write_compress() calls enqueued / execed 
by the kernel per second:

#!/usr/sbin/dtrace -qs

sdt:genunix::taskq-enqueue
/((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/
{
this-tqe = (taskq_ent_t *)arg1;
@enq[this-tqe-tqent_func] = count();
}

sdt:genunix::taskq-exec-end
/((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/
{
this-tqe = (taskq_ent_t *)arg1;
@exec[this-tqe-tqent_func] = count();
}

tick-1s {
/*
printf(%Y\n, walltimestamp);
*/
printf(TS(sec): %u\n, timestamp / 10);
printa(enqueue %a: [EMAIL PROTECTED], @enq);
printa(exec%a: [EMAIL PROTECTED], @exec);
trunc(@enq);
trunc(@exec);
}




I see bursts of zio_write_compress() calls enqueued / execed,
and periods of time where no zio_write_compress() taskq calls
are enqueued or execed.

10#  ~jk/src/dtrace/zpool_gzip7.d 
TS(sec): 7829
TS(sec): 7830
TS(sec): 7831
TS(sec): 7832
TS(sec): 7833
TS(sec): 7834
TS(sec): 7835
enqueue zfs`zio_write_compress: 1330
execzfs`zio_write_compress: 1330
TS(sec): 7836
TS(sec): 7837
TS(sec): 7838
TS(sec): 7839
TS(sec): 7840
TS(sec): 7841
TS(sec): 7842
TS(sec): 7843
TS(sec): 7844
enqueue zfs`zio_write_compress: 1116
execzfs`zio_write_compress: 1116
TS(sec): 7845
TS(sec): 7846
TS(sec): 7847
TS(sec): 7848
TS(sec): 7849
TS(sec): 7850
TS(sec): 7851
TS(sec): 7852
TS(sec): 7853
TS(sec): 7854
TS(sec): 7855
TS(sec): 7856
TS(sec): 7857
enqueue zfs`zio_write_compress: 932
execzfs`zio_write_compress: 932
TS(sec): 7858
TS(sec): 7859
TS(sec): 7860
TS(sec): 7861
TS(sec): 7862
TS(sec): 7863
TS(sec): 7864
TS(sec): 7865
TS(sec): 7866
TS(sec): 7867
enqueue zfs`zio_write_compress: 5
execzfs`zio_write_compress: 5
TS(sec): 7868
enqueue zfs`zio_write_compress: 774
execzfs`zio_write_compress: 774
TS(sec): 7869
TS(sec): 7870
TS(sec): 7871
TS(sec): 7872
TS(sec): 7873
TS(sec): 7874
TS(sec): 7875
TS(sec): 7876
enqueue zfs`zio_write_compress: 653
execzfs`zio_write_compress: 653
TS(sec): 7877
TS(sec): 7878
TS(sec): 7879
TS(sec): 7880
TS(sec): 7881


And a final dtrace script, which monitors scheduler activity while
filling a gzip compressed pool:

#!/usr/sbin/dtrace -qs

sched:::off-cpu,
sched:::on-cpu,
sched:::remain-cpu,
sched:::preempt
{
/*
@[probename, stack()] = count();
*/
@[probename] = count();
}


tick-1s {
printf(%Y, walltimestamp);
printa(@);
trunc(@);
}


It shows periods of time with absolutely *no*
scheduling activity (I guess this is when the
spa_zio_issue taskq is working on such a bug
batch of submitted gzip compression calls):

21# ~jk/src/dtrace/zpool_gzip9.d
2007 May  6 21:38:12
  preempt  13
  off-cpu 808
  on-cpu  808
2007

[zfs-discuss] zdb -l goes wild about the labels

2007-05-07 Thread Frank Batschulat

running a recent patched s10 system, zfs version 3, attempting to
dump the label information using zdb when the pool is online doesn't seem to 
give
a reasonable information, any particular reason for this ?

 # zpool status
pool: blade-mirror-pool
   state: ONLINE
   scrub: none requested
 config:
  
  NAME STATE READ WRITE CKSUM
  blade-mirror-pool  ONLINE   0 0 0
mirror ONLINE   0 0 0
  c2t12d0  ONLINE   0 0 0
  c2t13d0  ONLINE   0 0 0
  
 errors: No known data errors
  
pool: blade-single-pool
   state: ONLINE
   scrub: none requested
 config:
  
  NAMESTATE READ WRITE CKSUM
  blade-single-pool  ONLINE   0 0 0
c2t14d0   ONLINE   0 0 0
  
 errors: No known data errors
 # zdb -l /dev/dsk/c2t12d0
 
 LABEL 0
 
 
 LABEL 1
 
 failed to unpack label 1
 
 LABEL 2
 
 
 LABEL 3
 
 # zdb -l /dev/rdsk/c2t12d0
 
 LABEL 0
 
 
 LABEL 1
 
 failed to unpack label 1
 
 LABEL 2
 
 
 LABEL 3
 
 # zdb -l /dev/dsk/c2t14d0
 
 LABEL 0
 
 
 LABEL 1
 
 failed to unpack label 1
 
 LABEL 2
 
 failed to unpack label 2
 
 LABEL 3
 
 failed to unpack label 3
 #
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zdb -l goes wild about the labels

2007-05-07 Thread eric kustarz



On May 7, 2007, at 7:11 AM, Frank Batschulat wrote:


running a recent patched s10 system, zfs version 3, attempting to
dump the label information using zdb when the pool is online  
doesn't seem to give

a reasonable information, any particular reason for this ?

 # zpool status
pool: blade-mirror-pool
   state: ONLINE
   scrub: none requested
 config:

  NAME STATE READ WRITE CKSUM
  blade-mirror-pool  ONLINE   0 0 0
mirror ONLINE   0 0 0
  c2t12d0  ONLINE   0 0 0
  c2t13d0  ONLINE   0 0 0

 errors: No known data errors

pool: blade-single-pool
   state: ONLINE
   scrub: none requested
 config:

  NAMESTATE READ WRITE CKSUM
  blade-single-pool  ONLINE   0 0 0
c2t14d0   ONLINE   0 0 0

 errors: No known data errors
 # zdb -l /dev/dsk/c2t12d0


Try giving it:
# zdb -l /dev/dsk/c2t12d0s0

eric


 
 LABEL 0
 
 
 LABEL 1
 
 failed to unpack label 1
 
 LABEL 2
 
 
 LABEL 3
 
 # zdb -l /dev/rdsk/c2t12d0
 
 LABEL 0
 
 
 LABEL 1
 
 failed to unpack label 1
 
 LABEL 2
 
 
 LABEL 3
 
 # zdb -l /dev/dsk/c2t14d0
 
 LABEL 0
 
 
 LABEL 1
 
 failed to unpack label 1
 
 LABEL 2
 
 failed to unpack label 2
 
 LABEL 3
 
 failed to unpack label 3
 #


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Zpool, RaidZ how it spreads its disk load?

2007-05-07 Thread Tony Galway

Greetings learned ZFS geeks  guru’s,

Yet another question comes from my continued ZFS performance testing. This has 
to do with zpool iostat, and the strangeness that I do see.
I’ve created an eight (8) disk raidz pool from a Sun 3510 fibre array giving me 
a 465G volume.
# zpool create tp raidz c4t600 ... 8 disks worth of zpool
# zfs create tp/pool
# zfs set recordsize=8k tp/pool
# zfs set mountpoint=/pool tp/pool

I then create a 100G data file that is created by sequentially writing 64k 
blocks to the test data file. When I then issue a 
# zpool iostat -v tp 10 
I see the following strange behaviour. I see anywhere from up to 16 iterations 
(ie 160 seconds) of the following, where there are only writes to 2 of the 8 
disks:

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
testpool29.7G   514G  0  2.76K  0  22.1M
  raidz129.7G   514G  0  2.76K  0  22.1M
c4t600C0FF00A74531B659C5C00d0s6  -  -  0  0  0  
0
c4t600C0FF00A74533F3CF1AD00d0s6  -  -  0  0  0  
0
c4t600C0FF00A74534C5560FB00d0s6  -  -  0  0  0  
0
c4t600C0FF00A74535E50E5A400d0s6  -  -  0  1.38K  0  
2.76M
c4t600C0FF00A74537C1C061500d0s6  -  -  0  0  0  
0
c4t600C0FF00A745343B08C4B00d0s6  -  -  0  0  0  
0
c4t600C0FF00A745379CB90B600d0s6  -  -  0  0  0  
0
c4t600C0FF00A74530237AA9300d0s6  -  -  0  1.38K  0  
2.76M
--  -  -  -  -  -  -

During these periods, my data file does not grow in size, but then I see writes 
to all of the disks like the following:

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
testpool64.0G   480G  0  1.45K  0  11.6M
  raidz164.0G   480G  0  1.45K  0  11.6M
c4t600C0FF00A74531B659C5C00d0s6  -  -  0246  0  
8.22M
c4t600C0FF00A74533F3CF1AD00d0s6  -  -  0220  0  
8.23M
c4t600C0FF00A74534C5560FB00d0s6  -  -  0254  0  
8.20M
c4t600C0FF00A74535E50E5A400d0s6  -  -  0740  0  
1.45M
c4t600C0FF00A74537C1C061500d0s6  -  -  0299  0  
8.21M
c4t600C0FF00A745343B08C4B00d0s6  -  -  0284  0  
8.21M
c4t600C0FF00A745379CB90B600d0s6  -  -  0266  0  
8.22M
c4t600C0FF00A74530237AA9300d0s6  -  -  0740  0  
1.45M
--  -  -  -  -  -  -

And my data file will increase in size, but also notice notice, in the above, 
those disks that were being written to before, have a load that is consistent 
with the previous example. 

For background, the server, and the storage are dedicated solely to this 
testing, and there are no other applications being run at this time.

I thought that RaidZ would spread its load across all disks somewhat evenly. 
Can someone explain this result? I can consistently reproduce it as well.

Thanks
-Tony
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Zpool, RaidZ how it spreads its disk load?

2007-05-07 Thread Mario Goebbels

Something I was wondering about myself. What does the raidz toplevel (pseudo?) 
device do? Does it just indicate to the SPA, or whatever module is responsible, 
to additionally generate parity? The thing I'd like to know is if variable 
block sizes, dynamic striping et al still applies to a single RAIDZ device, too.

Thanks!
-mg
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Motley group of discs?

2007-05-07 Thread Cindy . Swearingen


Hi Lee,

You can decide whether you want to use ZFS for a root file system now.
You can find this info here:

http://opensolaris.org/os/community/zfs/boot/

Consider this setup for your other disks, which are:

250, 200 and 160 GB drives, and an external USB 2.0 600 GB drive

250GB = disk1
200GB = disk2
160GB = disk3
600GB = disk4 (spare)

I include a spare in this setup because you want to be protected from a 
disk failure. Since the replacement disk must be equal to or larger than

the disk to replace, I think this is best (safest) solution.

zpool create pool raidz disk1 disk2 disk3 spare disk4

This setup provides less capacity but better safety, which is probably
important for older disks. Because of the spare disk requirement (must
be equal to or larger in size), I don't see a better arrangement. I
hope someone else can provide one.

Your questions remind me that I need to provide add'l information about
the current ZFS spare feature...

Thanks,

Cindy




Lee Fyock wrote:

I didn't mean to kick up a fuss.

I'm reasonably zfs-savvy in that I've been reading about it for a  year 
or more. I'm a Mac developer and general geek; I'm excited about  zfs 
because it's new and cool.


At some point I'll replace my old desktop machine with something new  
and better -- probably when Unreal Tournament 2007 arrives,  
necessitating a faster processor and better graphics card. :-)


In the mean time, I'd like to hang out with the system and drives I  
have. As mike said, my understanding is that zfs would provide  error 
correction until a disc fails, if the setup is properly done.  That's 
the setup for which I'm requesting a recommendation.


I won't even be able to use zfs until Leopard arrives in October, but  I 
want to bone up so I'll be ready when it does.


Money isn't an issue here, but neither is creating an optimal zfs  
system. I'm curious what the right zfs configuration is for the  system 
I have.


Thanks!
Lee

On May 4, 2007, at 7:41 PM, Al Hopper wrote:


On Fri, 4 May 2007, mike wrote:


Isn't the benefit of ZFS that it will allow you to use even the most
unreliable risks and be able to inform you when they are  attempting to
corrupt your data?



Yes - I won't argue that ZFS can be applied exactly as you state  above.
However, ZFS is no substitute for bad practices that include:

- not proactively replacing mechanical components *before* they fail
- not having maintenance policies in place


To me it sounds like he is a SOHO user; may not have a lot of  funds to
go out and swap hardware on a whim like a company might.



You may be right - but you're simply guessing.  The original system
probably cost around $3k (?? I could be wrong).  So what I'm  suggesting,
that he spend ~ $300, represents ~ 10% of the original system cost.

Since the OP asked for advice, I've given him the best advice I can  come
up with.  I've also encountered many users who don't keep up to  date 
with

current computer hardware capabilities and pricing, and who may be
completely unaware that you can purchase two 500Gb disk drives,  with a 5
year warranty, for around $300.  And possibly less if you checkout  Frys
weekly bargin disk drive offers.

Now consider the total cost of ownership solution I recommended: 500
gigabytes of storage, coupled with ZFS, which translates into $60/ 
year for

5 years of error free storage capability.  Can life get any better  than
this! :)

Now contrast my recommendation with what you propose - re-targeting a
bunch of older disk drives, which incorporate older, less reliable
technology, with a view to saving money.  How much is your time worth?
How many hours will it take you to recover from a failure of one of  
these

older drives and the accompying increased risk of data loss.

If the ZFS savvy OP comes back to this list and says Als' solution  
is too
expensive I'm perfectly willing to rethink my recommendation.  For  
now, I

believe it to be the best recommendation I can devise.


ZFS in my opinion is well-suited for those without access to
continuously upgraded hardware and expensive fault-tolerant
hardware-based solutions. It is ideal for home installations where
people think their data is safe until the disk completely dies. I
don't know how many non-savvy people I have helped over the years who
has no data protection, and ZFS could offer them at least some
fault-tolerance and protection against corruption, and could help
notify them when it is time to shut off their computer and call
someone to come swap out their disk and move their data to a fresh
drive before it's completely failed...



Agreed.

One piece-of-the-puzzle that's missing right now IMHO, is a reliable,
two port, low-cost PCI SATA disk controller.  A solid/de-bugged 3124
driver would go a long way to ZFS-enabling a bunch of cost- 
constrained ZFS

users.

And, while I'm working this hardware wish list, please ... a PCI- Express
based version of the SuperMicro AOC-SAT2-MV8 8-port Marvell based disk
controller

Re: [zfs-discuss] Zpool, RaidZ how it spreads its disk load?

2007-05-07 Thread Chris Csanady


On 5/7/07, Tony Galway [EMAIL PROTECTED] wrote:

Greetings learned ZFS geeks  guru's,

Yet another question comes from my continued ZFS performance testing. This has 
to do with zpool iostat, and the strangeness that I do see.
I've created an eight (8) disk raidz pool from a Sun 3510 fibre array giving me 
a 465G volume.
# zpool create tp raidz c4t600 ... 8 disks worth of zpool
# zfs create tp/pool
# zfs set recordsize=8k tp/pool
# zfs set mountpoint=/pool tp/pool


This is a known problem, and is an interaction between the alignment
requirements imposed by RAID-Z and the small recordsize you have
chosen.  You may effectively avoid it in most situations by choosing a
RAID-Z strip width of 2^n+1.  For a fixed record size, this will work
perfectly well.

Even so, there will still be cases where small files will cause
problems for RAID-Z.  While it does not affect many people right now,
I think it will become a more serious issue when disks move to 4k
sectors.

I think the reason for the alignment constraint was to ensure that the
stranded space was accounted for, otherwise it would cause problems as
the pool fills up.  (Consider a 3 device RAID-Z, where only one data
sector and one parity sector are written; the third sector in that
stripe is essentially dead space.)

Would it be possible (or worthwhile) to make the allocator aware of
this dead space, rather than imposing the alignment requirements?
Something like a concept of tentatively allocated space in the
allocator, which would be managed based on the requirements of the
vdev.  Using such a mechanism, it could coalesce the space if possible
for allocations.  Of course, it would also have to convert the
misaligned bits back into tentatively allocated space when blocks are
freed.

While I expect this may require changes which would not easily be
backward compatible, the alignment on RAID-Z has always felt a bit
wrong.  While the more severe effects can be addressed by also writing
out the dead space, that will not address uneven placement of data and
parity across the stripes.

Any thoughts?

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Motley group of discs?

2007-05-07 Thread Mario Goebbels

 Given the odd sizes of your drives, there might not
 be one, unless you
 are willing to sacrifice capacity.

I think for the SoHo and home user scenarios, I think it might be of advantage 
if the disk drivers offer unified APIs to read out and interpret disk drive 
diagnostics, like SMART on ATA and whatever there's for SCSI/SAS, so that ZFS 
can react on it. Be it automatically invoking spare discs or showing warnings 
in the pool status.

Or even automatically evacuating the device (given that ZFS will support it at 
some point) depending on the severity, should there be enough space on the 
other disks. For instance going top to bottom through the filesystems by 
importance, which would however an importance attribute.

-mg
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Zpool, RaidZ how it spreads its disk load?

2007-05-07 Thread Mario Goebbels

What are these alignment requirements?

I would have thought that at the lowest level, parity stripes would have been 
allocated traditionally, while treating the remaining usable space like a JBOD 
the level above, thus not subject to any restraints (apart when getting close 
to the parity stripe boundaries).

-mg
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Motley group of discs?

2007-05-07 Thread Lee Fyock


Cindy,

Thanks so much for the response -- this is the first one that I  
consider an actual answer. :-)


I'm still unclear on exactly what I end up with. I apologize in  
advance for my ignorance -- the ZFS admin guide assumes knowledge  
that I don't yet have.


I assume that disk4 is a hot spare, so if one of the other disks die,  
it'll kick into active use. Is data immediately replicated from the  
other surviving disks to disk4?


What usable capacity do I end up with? 160 GB (the smallest disk) *  
3? Or less, because raidz has parity overhead? Or more, because that  
overhead can be stored on the larger disks?


If I didn't need a hot spare, but instead could live with running out  
and buying a new drive to add on as soon as one fails, what  
configuration would I use then?


Thanks!
Lee

On May 7, 2007, at 2:44 PM, [EMAIL PROTECTED] wrote:


Hi Lee,

You can decide whether you want to use ZFS for a root file system now.
You can find this info here:

http://opensolaris.org/os/community/zfs/boot/

Consider this setup for your other disks, which are:

250, 200 and 160 GB drives, and an external USB 2.0 600 GB drive

250GB = disk1
200GB = disk2
160GB = disk3
600GB = disk4 (spare)

I include a spare in this setup because you want to be protected  
from a disk failure. Since the replacement disk must be equal to or  
larger than

the disk to replace, I think this is best (safest) solution.

zpool create pool raidz disk1 disk2 disk3 spare disk4

This setup provides less capacity but better safety, which is probably
important for older disks. Because of the spare disk requirement (must
be equal to or larger in size), I don't see a better arrangement. I
hope someone else can provide one.

Your questions remind me that I need to provide add'l information  
about

the current ZFS spare feature...

Thanks,

Cindy



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Motley group of discs?

2007-05-07 Thread Toby Thain



On 7-May-07, at 3:44 PM, [EMAIL PROTECTED] wrote:


Hi Lee,

You can decide whether you want to use ZFS for a root file system now.
You can find this info here:

http://opensolaris.org/os/community/zfs/boot/


Bearing in mind that his machine is a G4 PowerPC. When Solaris 10 is  
ported to this platform, please let me know, too.


--Toby
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Motley group of discs?

2007-05-07 Thread Luke Scharf


Toby Thain wrote:


On 7-May-07, at 3:44 PM, [EMAIL PROTECTED] wrote:


Hi Lee,

You can decide whether you want to use ZFS for a root file system now.
You can find this info here:

http://opensolaris.org/os/community/zfs/boot/


Bearing in mind that his machine is a G4 PowerPC. When Solaris 10 is 
ported to this platform, please let me know, too.


For Solaris on PowerPC, it's probably easiest to just monitor this project:
http://www.opensolaris.org/os/community/power_pc/

-Luke


smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zpool, RaidZ how it spreads its disk load?

2007-05-07 Thread James Blackburn


On 5/7/07, Chris Csanady [EMAIL PROTECTED] wrote:

On 5/7/07, Tony Galway [EMAIL PROTECTED] wrote:
 Greetings learned ZFS geeks  guru's,

 Yet another question comes from my continued ZFS performance testing. This 
has to do with zpool iostat, and the strangeness that I do see.
 I've created an eight (8) disk raidz pool from a Sun 3510 fibre array giving 
me a 465G volume.
 # zpool create tp raidz c4t600 ... 8 disks worth of zpool
 # zfs create tp/pool
 # zfs set recordsize=8k tp/pool
 # zfs set mountpoint=/pool tp/pool

This is a known problem, and is an interaction between the alignment
requirements imposed by RAID-Z and the small recordsize you have
chosen.  You may effectively avoid it in most situations by choosing a
RAID-Z strip width of 2^n+1.  For a fixed record size, this will work
perfectly well.


Well an alignment issue may be the case for the second iostat output,
but not for the first.  I'd suspect in the first case the I/O being
seen is the syncing of the transaction group and associated block
pointers to the RAID (though I could be very wrong on this).

Also I'm also not entirely sure about your formula (how can you choose
a stripe width that's not a power of 2?).  For an 8 disk single parity
RAID data is going to be written to 7 disks and parity to 1.  If each
disk block is 512 bytes, then 128 disk blocks will be written for each
64k filesystem block.   This will require 18 rows (and a bit of the
19th) on the 7 data disks.  Therefore we have a requirement for 128
blocks of data + 19 blocks of parity = 147 blocks.  Now if we take
into account the alignment requirement it says that the number of
block written must equal a multiple of (nparity + 1).  So 148 blocks
will be written.  148 % 8 = 4 This means that on each successive 64k
write the 'extra' roundup block will alternate between one disk and
another 4 disks apart (which happens to be just what we see).


Even so, there will still be cases where small files will cause
problems for RAID-Z.  While it does not affect many people right now,
I think it will become a more serious issue when disks move to 4k
sectors.


True.  But when disks move to 4k sectors they will be on the order of
terabytes in size.  It would probably be more pain than it's worth to
try to efficiently pack these.  (And it's very likely that your
filesystem and per file block size will be at least 4k.)


I think the reason for the alignment constraint was to ensure that the
stranded space was accounted for, otherwise it would cause problems as
the pool fills up.  (Consider a 3 device RAID-Z, where only one data
sector and one parity sector are written; the third sector in that
stripe is essentially dead space.)


Indeed.  As Adam explained here:
http://www.opensolaris.org/jive/thread.jspa?threadID=26115tstart=0 it
specifically pertains to what happens if you allow an odd numer of
disk blocks to be written, you then free that block and try to fill
the space with 512 bytes fs blocks -- you get a single 512-byte hole
that you can't fill.


Would it be possible (or worthwhile) to make the allocator aware of
this dead space, rather than imposing the alignment requirements?
Something like a concept of tentatively allocated space in the
allocator, which would be managed based on the requirements of the
vdev.  Using such a mechanism, it could coalesce the space if possible
for allocations.  Of course, it would also have to convert the
misaligned bits back into tentatively allocated space when blocks are
freed.


It would add complexity and this roundup only occurs in the RAID-Z
vdev.  As the metaslab/space allocator doesn't have any idea about the
on disk layout it wouldn't be able to say whether successive single
free blocks in the space map are on the same/different disks -- and
this would further add to the complexity of data/parity allocation
within the RAID-Z vdev itself.


While I expect this may require changes which would not easily be
backward compatible, the alignment on RAID-Z has always felt a bit
wrong.  While the more severe effects can be addressed by also writing
out the dead space, that will not address uneven placement of data and
parity across the stripes.


I've also had issues with this (under a slightly different guise).
I've implemented a rather naive raidz implementation based on the
current implementation which allows you to use all the disk space on
an array of mismatched disks.

What I've done is use the grid portion of the block pointer to specify
a RAID 'version' number (of which you are currently allowed 255 (0
being reserved for the current layout)).  I've then organized it such
that metaslab_init is specialised in the raidz vdev (a la
vdev_raidz_asize()) and allocates the metaslab as before, but forces a
new metaslab when a boundary is reached that would alter the number of
disks in a stripe.  This increases the number of metaslabs by O(number
of disks).  It also means that you need to psize_to_asize slightly
later in the metaslab allocation

Re: [zfs-discuss] Motley group of discs?

2007-05-07 Thread Cindy . Swearingen


Lee,

Yes, the hot spare (disk4) should kick if another disk in the pool fails 
and yes, the data is moved to disk4.


You are correct:

160 GB (the smallest disk) * 3 + raidz parity info

Here's the size of raidz pool comprised of 3 136-GB disks:

# zpool list
NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
pool408G 98K408G 0%  ONLINE -
# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
pool  89.9K   267G  32.6K  /pool

The pool is 408GB in size but usable space in the pool is 267GB.

If you added the 600GB disk to the pool, then you'll still lose out
on the extra capacity because of the smaller disks, which is why
I suggested using it as a spare.

Regarding this:

If I didn't need a hot spare, but instead could live with running out
and buying a new drive to add on as soon as one fails, what
configuration would I use then?

I don't have any add'l ideas but I still recommend going with a spare.

Cindy





Lee Fyock wrote:

Cindy,

Thanks so much for the response -- this is the first one that I consider 
an actual answer. :-)


I'm still unclear on exactly what I end up with. I apologize in advance 
for my ignorance -- the ZFS admin guide assumes knowledge that I don't 
yet have.


I assume that disk4 is a hot spare, so if one of the other disks die, 
it'll kick into active use. Is data immediately replicated from the 
other surviving disks to disk4?


What usable capacity do I end up with? 160 GB (the smallest disk) * 3? 
Or less, because raidz has parity overhead? Or more, because that 
overhead can be stored on the larger disks?


If I didn't need a hot spare, but instead could live with running out 
and buying a new drive to add on as soon as one fails, what 
configuration would I use then?


Thanks!
Lee

On May 7, 2007, at 2:44 PM, [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED] wrote:



Hi Lee,


You can decide whether you want to use ZFS for a root file system now.

You can find this info here:


http://opensolaris.org/os/community/zfs/boot/


Consider this setup for your other disks, which are:


250, 200 and 160 GB drives, and an external USB 2.0 600 GB drive


250GB = disk1

200GB = disk2

160GB = disk3

600GB = disk4 (spare)


I include a spare in this setup because you want to be protected from 
a disk failure. Since the replacement disk must be equal to or larger than


the disk to replace, I think this is best (safest) solution.


zpool create pool raidz disk1 disk2 disk3 spare disk4


This setup provides less capacity but better safety, which is probably

important for older disks. Because of the spare disk requirement (must

be equal to or larger in size), I don't see a better arrangement. I

hope someone else can provide one.


Your questions remind me that I need to provide add'l information about

the current ZFS spare feature...


Thanks,


Cindy







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Motley group of discs? (doing it right, or right now)

2007-05-07 Thread Andy Lubel

I think it will be in the next.next (10.6) OSX, we just need to get apple to
stop playing with their silly cell phone (that I cant help but want, damn
them!).

I have similar situation at home, but what I do is use Solaris 10 on a
cheapish x86 box with 6 400gb IDE/SATA disks, I then make them into ISCSI
targets and use that free GlobalSAN initiator ([EMAIL PROTECTED]).  I once was 
like
you, had 5 USB/Firewire drives hanging off everything and eventually I just
got fed up with the mess of cables and wall warts.

Perhaps my method of putting redundant and fast storage isn't as easy to
achieve to everyone else.  If you want more details about my setup, just
email me directly, I don't mind :)

-Andy



On 5/7/07 4:48 PM, [EMAIL PROTECTED] [EMAIL PROTECTED]
wrote:

 Lee,
 
 Yes, the hot spare (disk4) should kick if another disk in the pool fails
 and yes, the data is moved to disk4.
 
 You are correct:
 
 160 GB (the smallest disk) * 3 + raidz parity info
 
 Here's the size of raidz pool comprised of 3 136-GB disks:
 
 # zpool list
 NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
 pool408G 98K408G 0%  ONLINE -
 # zfs list
 NAME   USED  AVAIL  REFER  MOUNTPOINT
 pool  89.9K   267G  32.6K  /pool
 
 The pool is 408GB in size but usable space in the pool is 267GB.
 
 If you added the 600GB disk to the pool, then you'll still lose out
 on the extra capacity because of the smaller disks, which is why
 I suggested using it as a spare.
 
 Regarding this:
 
 If I didn't need a hot spare, but instead could live with running out
 and buying a new drive to add on as soon as one fails, what
 configuration would I use then?
 
 I don't have any add'l ideas but I still recommend going with a spare.
 
 Cindy
 
 
 
 
 
 Lee Fyock wrote:
 Cindy,
 
 Thanks so much for the response -- this is the first one that I consider
 an actual answer. :-)
 
 I'm still unclear on exactly what I end up with. I apologize in advance
 for my ignorance -- the ZFS admin guide assumes knowledge that I don't
 yet have.
 
 I assume that disk4 is a hot spare, so if one of the other disks die,
 it'll kick into active use. Is data immediately replicated from the
 other surviving disks to disk4?
 
 What usable capacity do I end up with? 160 GB (the smallest disk) * 3?
 Or less, because raidz has parity overhead? Or more, because that
 overhead can be stored on the larger disks?
 
 If I didn't need a hot spare, but instead could live with running out
 and buying a new drive to add on as soon as one fails, what
 configuration would I use then?
 
 Thanks!
 Lee
 
 On May 7, 2007, at 2:44 PM, [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED] wrote:
 
 Hi Lee,
 
 
 You can decide whether you want to use ZFS for a root file system now.
 
 You can find this info here:
 
 
 http://opensolaris.org/os/community/zfs/boot/
 
 
 Consider this setup for your other disks, which are:
 
 
 250, 200 and 160 GB drives, and an external USB 2.0 600 GB drive
 
 
 250GB = disk1
 
 200GB = disk2
 
 160GB = disk3
 
 600GB = disk4 (spare)
 
 
 I include a spare in this setup because you want to be protected from
 a disk failure. Since the replacement disk must be equal to or larger than
 
 the disk to replace, I think this is best (safest) solution.
 
 
 zpool create pool raidz disk1 disk2 disk3 spare disk4
 
 
 This setup provides less capacity but better safety, which is probably
 
 important for older disks. Because of the spare disk requirement (must
 
 be equal to or larger in size), I don't see a better arrangement. I
 
 hope someone else can provide one.
 
 
 Your questions remind me that I need to provide add'l information about
 
 the current ZFS spare feature...
 
 
 Thanks,
 
 
 Cindy
 
 
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



-- 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS raid on removable media for backups/temporary use possible?

2007-05-07 Thread Tom Buskey

I've been using long SATA cables routed out through the case to a home built 
chassis with its own power supply for a year now.  Not even eSATA.  That part 
works well.

Substitute this for USB/Firewire/SCSI/USB thumb drives.  It's really the same 
problem.

Ok, now you want to deal with a ZFS zpool raid on multiple(?) removable drives.

How well does ZFS work on removable media?  In a RAID configuration?  Are there 
issues with matching device names to disks?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS raid on removable media for backups/temporary use possible?

2007-05-07 Thread Robert Thurlow


Tom Buskey wrote:


How well does ZFS work on removable media?  In a RAID configuration?  Are there 
issues with matching device names to disks?


I've had a zpool with 4-250Gb IDE drives in three places recently:

- in an external 4-bay Firewire case, attached to a Sparc box
- inside a dual-Opteron white box, connected to a 2-channel add-in IDE
  controller
- inside the dual-Opteron, connected via 4 IDE-to-SATA convertors to
  the motherboard's built-in SATA controller

In each case, once 'format' found the drives, ZFS was easily able to
import the pool without any fuss or issues.  Performance was miserable
when running off the add-in IDE controller, but great in the other two
cases.

As far as I can see, this stuff generally just works.

Rob T
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS raid on removable media for backups/temporary use possible?

2007-05-07 Thread Corey Jewett

There's a video put out by some Sun people in Germany (IIRC) they  
made several 4 RAIDZs on 3 USB hubs using a total of 12 USB  
thumbdrives. At one point they pulled all the USB sticks, shuffled  
them and then re-imported the pool. Worked like butter.


Corey


On May 7, 2007, at 1:30 PM, Tom Buskey wrote:

I've been using long SATA cables routed out through the case to a  
home built chassis with its own power supply for a year now.  Not  
even eSATA.  That part works well.


Substitute this for USB/Firewire/SCSI/USB thumb drives.  It's  
really the same problem.


Ok, now you want to deal with a ZFS zpool raid on multiple(?)  
removable drives.


How well does ZFS work on removable media?  In a RAID  
configuration?  Are there issues with matching device names to disks?



This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Boot disk clone with zpool present

2007-05-07 Thread Mark V. Dalton

I'm hoping that this is simpler than I think it is. :-)

We routinely clone our boot disks using a fairly simple script that:
1) Copies the source disk's partition layout to the target disk using 
[i]prtvtoc[/i], [i]fmthard[/i] and [i]installboot.[/i]
2) Using a list, runs [i]newfs[/i] against the target slice and [i]ufsdump[/i] 
of the source slice piped to a [i]ufsrestore[/i] of the target slice.

The result is a bootable clone of the source disk.  Granted, there are 
vulnerabilities with using ufsdump on a mounted file system but it works for us.

We're now looking at using ZFS file systems for /usr, /var, /opt, /export/home, 
etc., leaving the root file system (/) as UFS and swap as a bare slice as it 
is now.

I've successfully created an Alternate Root Pool and have replicated the ZFS 
file systems from another source [i]zpool[/i] into the Alternate Root Pool 
using zfs send and zfs receive.  Right now, I'm doing this without the 
benefit of a bootable system to play with.  I'm experimenting with just 
ordinary file systems, [b][i]not[/i][/b] /ufs, /opt, etc.  Now comes the 
chicken and the egg part.

I think I would have to fix-up the mount points of the newly copied ZFS file 
systems on the Alternate Root Pool so that they remain set to /ufs, /opt, 
etc.  By the way, would these file systems have to be legacy mount points?  
It seems like they would have to be.

Here's the part that makes my head hurt: If I've created this Alternate Root 
Pool on this separate disk slice and populated it and exported it and I've 
replicated a UFS root (/) file system on that same disk but in slice 0, how 
does that zpool get connected when I try to boot that cloned disk?

Fundamentally, the question is, how does one replicate a boot/system disk that 
contains zpool(s) for file systems other than the root file system?

This is fairly straightforward with UFS file system technology.  The addition 
of zpool identity seems to complicate the issue considerably.

Thank you very much for any advice or clarification.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Support for remote mirroring

2007-05-07 Thread Matthew Ahrens


Aaron Newcomb wrote:

Does ZFS support any type of remote mirroring? It seems at present my
only two options to achieve this would be Sun Cluster or Availability
Suite. I thought that this functionality was in the works, but I haven't
heard anything lately.


You could put something together using iSCSI, or zfs send/recv.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS vs UFS2 overhead and may be a bug?

2007-05-07 Thread Matthew Ahrens


Pawel Jakub Dawidek wrote:

This is what I see on Solaris (hole is 4GB):

# /usr/bin/time dd if=/ufs/hole of=/dev/null bs=128k
real   23.7
# /usr/bin/time dd if=/zfs/hole of=/dev/null bs=128k
real   21.2

# /usr/bin/time dd if=/ufs/hole of=/dev/null bs=4k
real   31.4
# /usr/bin/time dd if=/zfs/hole of=/dev/null bs=4k
real 7:32.2


This is probably because the time to execute this on ZFS is dominated by 
per-systemcall costs, rather than per-byte costs.  You are doing 32x more 
system calls with the 4k blocksize, and it is taking 20x longer.


That said, I could be wrong, and yowtch, that's much slower than I'd like!

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Motley group of discs? (doing it right, or right now)

2007-05-07 Thread Toby Thain



On 7-May-07, at 5:27 PM, Andy Lubel wrote:


I think it will be in the next.next (10.6) OSX,


baselessSpeculation
Well, the iPhone forced a few months schedule slip, perhaps *instead  
of* dropping features?

/baselessSpeculation

Mind you I wouldn't be particularly surprised if ZFS wasn't in 10.5.  
Just so long as we get it eventually :-)


***suppresses giggle at MS whose schedule slipped years AND dropped  
any interesting features***



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Boot disk clone with zpool present

2007-05-07 Thread Richard Elling


Mark V. Dalton wrote:

I'm hoping that this is simpler than I think it is. :-)

We routinely clone our boot disks using a fairly simple script that:
1) Copies the source disk's partition layout to the target disk using 
[i]prtvtoc[/i], [i]fmthard[/i] and [i]installboot.[/i]


Danger Will Robinson!  Disks can and do have different sizes, even disks with 
the
same (Sun) part number.  This causes difficulties or inefficiencies when you 
blindly
copy the partition table like this.  You will be better off using a script which
creates the new partition map based upon the actual geometry and your desired
configuration.  This really isn't hard, but it does become site specific.  Hint 
use bc.


2) Using a list, runs [i]newfs[/i] against the target slice and [i]ufsdump[/i] 
of the source slice piped to a [i]ufsrestore[/i] of the target slice.


Yep, been doing that for decades.  Actually, cpio is generally easier.


The result is a bootable clone of the source disk.  Granted, there are 
vulnerabilities with using ufsdump on a mounted file system but it works for us.


Actually, cpio is generally easier.


We're now looking at using ZFS file systems for /usr, /var, /opt, /export/home, etc., 
leaving the root file system (/) as UFS and swap as a bare slice as it is now.


Actually, cpio is generally easier.


I've successfully created an Alternate Root Pool and have replicated the ZFS file systems from another source [i]zpool[/i] into 
the Alternate Root Pool using zfs send and zfs receive.  Right now, I'm doing this without the benefit of 
a bootable system to play with.  I'm experimenting with just ordinary file systems, [b][i]not[/i][/b] /ufs, 
/opt, etc.  Now comes the chicken and the egg part.

I think I would have to fix-up the mount points of the newly copied ZFS file systems on the Alternate Root 
Pool so that they remain set to /ufs, /opt, etc.  By the way, would these file 
systems have to be legacy mount points?  It seems like they would have to be.

Here's the part that makes my head hurt: If I've created this Alternate Root Pool on this 
separate disk slice and populated it and exported it and I've replicated a UFS root 
(/) file system on that same disk but in slice 0, how does that zpool get 
connected when I try to boot that cloned disk?

Fundamentally, the question is, how does one replicate a boot/system disk that 
contains zpool(s) for file systems other than the root file system?

This is fairly straightforward with UFS file system technology.  The addition 
of zpool identity seems to complicate the issue considerably.


IMHO, it is more straightforward with ZFS, but I'm biased :-).  For information
see the ZFS boot pages, http://www.opensolaris.org/os/community/zfs/boot/

How far you can go with this today depends on whether you're using SPARC or x86.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS vs UFS2 overhead and may be a bug?

2007-05-07 Thread Bakul Shah

 Pawel Jakub Dawidek wrote:
  This is what I see on Solaris (hole is 4GB):
  
  # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=128k
  real   23.7
  # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=128k
  real   21.2
  
  # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=4k
  real   31.4
  # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=4k
  real 7:32.2
 
 This is probably because the time to execute this on ZFS is dominated by 
 per-systemcall costs, rather than per-byte costs.  You are doing 32x more 
 system calls with the 4k blocksize, and it is taking 20x longer.
 
 That said, I could be wrong, and yowtch, that's much slower than I'd like!

You missed my earlier post where I showed accessing a hole
file takes much longer than accessing a regular data file for
blocksize of 4k and below.  I will repeat the most dramatic
difference:

  ZFSUFS2
Elapsed System  Elapsed System 
md5 SPACY   210.01   77.46  337.51   25.54
md5 HOLEY   856.39  801.21   82.11   28.31

I used md5 because all but a couple of syscalls are for
reading the file (with a buffer of 1K).  dd would make
an equal number of calls for writing.

For both file systems and both cases the filesize is the same
but SPACY has 10GB allocated while HOLEY was created with
truncate -s 10G HOLEY.

Look at the system times.  On UFS2 system time is a little
bit more for the HOLEY case because it has to clear a block.
ON ZFS it is over 10 times more!  Something is very wrong.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Support for remote mirroring

2007-05-07 Thread Aaron Newcomb

ZFS send/receive?? I am not familiar with this feature. Is there a doc I 
can reference?


Thanks,

Aaron Newcomb
Sr. Systems Engineer
Sun Microsystems
[EMAIL PROTECTED]
Cell: 513-238-9511
Office: 513-562-4409




Matthew Ahrens wrote:

Aaron Newcomb wrote:

Does ZFS support any type of remote mirroring? It seems at present my
only two options to achieve this would be Sun Cluster or Availability
Suite. I thought that this functionality was in the works, but I haven't
heard anything lately.


You could put something together using iSCSI, or zfs send/recv.

--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Support for remote mirroring

2007-05-07 Thread Torrey McMahon


Matthew Ahrens wrote:

Aaron Newcomb wrote:

Does ZFS support any type of remote mirroring? It seems at present my
only two options to achieve this would be Sun Cluster or Availability
Suite. I thought that this functionality was in the works, but I haven't
heard anything lately.


You could put something together using iSCSI, or zfs send/recv.


I think the definition of remote mirror is up for grabs here but in my 
mind remote mirror means the remote node has a always up to date copy of 
the primary data set modulo any transactions in flight. AVS, aka remote 
mirror, aka sndr, is usually used for this kind of work on the host. 
Storage arrays have things like, ahem, remote mirror, truecopy, srdf, etc.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS Support for remote mirroring

2007-05-07 Thread Bryan Wagoner

I guess when we are defining a mirror, are you talking about a synchronous 
mirror or an asynchronous mirror?

As stated earlier,  if you are looking for an asynchronous mirror and do not 
want to use AVS, you can use zfs send and receive and craft a fairly simple 
script that runs constantly and updates a remote filesystem. 

zfs send takes a snapshot and turns it into a datastream to standard out while 
zfs receive takes a stdin datastream and outputs it to a zfs filesystem. The 
zfs send and receive structures are only limited to your creativity.

one example use might be the following

[i]zfs send pool/[EMAIL PROTECTED] | ssh remote_hostname zfs receive 
remotepool/fs2[/i]

That would get you your initial copy, then you would have to take a snap and do 
incrementals from there on in with something like

[i]zfs send -i pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] | ssh 
remote_hostname zfs receive remotepool/fs2[/i]

note that the filesystem at the other end (fs2 in thsi case) will be a live 
filesystem that you can use anytime.

Now with that incremental commandline, you might run into a bug that is well 
known and you can find a work around in these forums, so I won't get into it, 
but your script would have to incorporate the workaround which would basically 
run a [i]zfs rollback[/i] command on the remote host before you propagate the 
incremental changes.

~Bryan
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Motley group of discs?

2007-05-07 Thread Bryan Wagoner

Well since we are talking about for home use, I never tried as a spare, but if 
you want to get real nutty, do the setup cindys suggested but format the 600GB 
drive as UFS or some other filesystem and then try and create a 250GB file 
device as a spare on that UFS drive. it will give you redundancy and not waste 
all the space on the 600GB drive.

Zfs allows the use of file devices instead of hardware devices zfs create test 
/tmp/testfiledevice as an example

If you do it, let us know how it goes :)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Motley group of discs?

2007-05-07 Thread Richard Elling


Bryan Wagoner wrote:

Well since we are talking about for home use, I never tried as a spare, but if 
you want to get real nutty, do the setup cindys suggested but format the 600GB 
drive as UFS or some other filesystem and then try and create a 250GB file 
device as a spare on that UFS drive. it will give you redundancy and not waste 
all the space on the 600GB drive.

Zfs allows the use of file devices instead of hardware devices zfs create test 
/tmp/testfiledevice as an example


However, I do not believe it is safe to use files under UFS as ZFS vdevs.
ZFS expects data to be flushed and, IIRC, UFS does not guarantee that for
regular files.  Search the archives for more info.

That said, you can certainly divide the 600 GByte disk into 3 slices.
Later, you can always replace a slice with a different, bigger slice to
grow.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Benchmark which models ISP workloads

2007-05-07 Thread Yusuf Goolamabbas

This benchmark models real-world workload faced by many ISP's worldwide everyday

http://untroubled.org/benchmarking/2004-04/

Would appreciate if the ZFS team or the Performance group could take a look at 
it. I've run this myself on b61 (minor mods to the driver program) but 
obviously Team ZFS or performance team may be interested in comparing results 
with different operating systems
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Samba and ZFS ACL Question

2007-05-07 Thread Leon Koll

 Have there been any new developments regarding the
 availability of vfs_zfsacl.c?  Jeb, were you able to
 get a copy of Jiri's work-in-progress?  I need this
 ASAP (as I'm sure most everyone watching this thread
 does)...

me too... A.S.A.P.!!!

[i]-- leon[/i]
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

38 matches

Mail list logo