Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Fajar A. Nugraha
On Tue, Dec 20, 2011 at 9:51 AM, Frank Cusack  wrote:
> If you don't detach the smaller drive, the pool size won't increase.  Even
> if the remaining smaller drive fails, that doesn't mean you have to detach
> it.  So yes, the pool size might increase, but it won't be "unexpectedly".
> It will be because you detached all smaller drives.  Also, even if a smaller
> drive is failed, it can still be attached.

Isn't autoexpand=off by default, so it won't use the larger size anyway?

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Frank Cusack
If you don't detach the smaller drive, the pool size won't increase.  Even
if the remaining smaller drive fails, that doesn't mean you have to detach
it.  So yes, the pool size might increase, but it won't be "unexpectedly".
It will be because you detached all smaller drives.  Also, even if a
smaller drive is failed, it can still be attached.

It doesn't make sense for attach to do anything with partition tables, IMHO.

I *always* order the spare when I order the original drives, to have it on
hand, even for my home system.  Drive sizes change more frequently than
they fail, for me.  Sure, when I use the spare I may not be able to order a
new spare of the same size, but at least at that time I have time to
prepare and am not scrambling.

On Mon, Dec 19, 2011 at 3:55 PM, Gregg Wonderly  wrote:

>  That's why I'm asking.  I think it should always mirror the partition
> table and allocate exactly the same amount of space so that the pool
> doesn't suddenly change sizes unexpectedly and require a disk size that I
> don't have at hand, to put the mirror back up.
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!

2011-12-19 Thread Richard Elling
comments below…

On Dec 18, 2011, at 6:53 AM, Jan-Aage Frydenbø-Bruvoll wrote:

> Dear List,
> 
> I have a storage server running OpenIndiana with a number of storage
> pools on it. All the pools' disks come off the same controller, and
> all pools are backed by SSD-based l2arc and ZIL. Performance is
> excellent on all pools but one, and I am struggling greatly to figure
> out what is wrong.
> 
> A very basic test shows the following - pretty much typical
> performance at the moment:
> 
> root@stor:/# for a in pool1 pool2 pool3; do dd if=/dev/zero of=$a/file
> bs=1M count=10; done
> 10+0 records in
> 10+0 records out
> 10485760 bytes (10 MB) copied, 0.00772965 s, 1.4 GB/s
> 10+0 records in
> 10+0 records out
> 10485760 bytes (10 MB) copied, 0.00996472 s, 1.1 GB/s
> 10+0 records in
> 10+0 records out
> 10485760 bytes (10 MB) copied, 71.8995 s, 146 kB/s

Enable compression and they should all go fast :-)
But seriously, you could be getting tripped up by the allocator. There are
several different allocator algorithms and they all begin to thrash at high
utilization. Some are better than others for various cases. For OpenIndiana,
you might be getting bit by the allocator. One troubleshooting tip would be
to observe the utilization of the metaslabs:
zdb -m pool3

If there are metaslabs that are > 96% full, then look more closely at the 
allocator algorithms.

> 
> The zpool status of the affected pool is:
> 
> root@stor:/# zpool status pool3
>  pool: pool3
> state: ONLINE
> scan: resilvered 222G in 24h2m with 0 errors on Wed Dec 14 15:20:11 2011
> config:
> 
>NAME  STATE READ WRITE CKSUM
>pool3 ONLINE   0 0 0
>  c1t0d0  ONLINE   0 0 0
>  c1t1d0  ONLINE   0 0 0
>  c1t2d0  ONLINE   0 0 0
>  c1t3d0  ONLINE   0 0 0
>  c1t4d0  ONLINE   0 0 0
>  c1t5d0  ONLINE   0 0 0
>  c1t6d0  ONLINE   0 0 0
>  c1t7d0  ONLINE   0 0 0
>  c1t8d0  ONLINE   0 0 0
>  c1t9d0  ONLINE   0 0 0
>  c1t10d0 ONLINE   0 0 0
>  mirror-12   ONLINE   0 0 0
>c1t26d0   ONLINE   0 0 0
>c1t27d0   ONLINE   0 0 0
>  mirror-13   ONLINE   0 0 0
>c1t28d0   ONLINE   0 0 0
>c1t29d0   ONLINE   0 0 0
>  mirror-14   ONLINE   0 0 0
>c1t34d0   ONLINE   0 0 0
>c1t35d0   ONLINE   0 0 0
>logs
>  mirror-11   ONLINE   0 0 0
>c2t2d0p8  ONLINE   0 0 0
>c2t3d0p8  ONLINE   0 0 0
>cache
>  c2t2d0p12   ONLINE   0 0 0
>  c2t3d0p12   ONLINE   0 0 0
> 
> errors: No known data errors
> 
> Ditto for the disk controller - MegaCli reports zero errors, be that
> on the controller itself, on this pool's disks or on any of the other
> attached disks.
> 
> I am pretty sure I am dealing with a disk-based problem here, i.e. a
> flaky disk that is "just" slow without exhibiting any actual data
> errors, holding the rest of the pool back, but I am at a miss as how
> to pinpoint what is going on.

"iostat -x" shows the average service time of each disk. If one disk or 
set of disks is a lot slower, when also busy, then it should be clearly visible
in the iostat output. Personally, I often use something like "iostat -zxCn 10"
for 10-second samples.
 -- richard

> 
> Would anybody on the list be able to give me any pointers as how to
> dig up more detailed information about the pool's/hardware's
> performance?
> 
> Thank you in advance for your kind assistance.
> 
> Best regards
> Jan
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

ZFS and performance consulting
http://www.RichardElling.com
LISA '11, Boston, MA, December 4-9 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!

2011-12-19 Thread Gregg Wonderly

On 12/18/2011 4:23 PM, Jan-Aage Frydenbø-Bruvoll wrote:

Hi,

On Sun, Dec 18, 2011 at 22:14, Nathan Kroenert  wrote:

  I know some others may already have pointed this out - but I can't see it
and not say something...

Do you realise that losing a single disk in that pool could pretty much
render the whole thing busted?

At least for me - the rate at which _I_ seem to lose disks, it would be
worth considering something different ;)

Yeah, I have thought that thought myself. I am pretty sure I have a
broken disk, however I cannot for the life of me find out which one.
zpool status gives me nothing to work on, MegaCli reports that all
virtual and physical drives are fine, and iostat gives me nothing
either.

What other tools are there out there that could help me pinpoint
what's going on?



One choice would be to take a single drive that you believe is in good working 
condition, and add it as a mirror to each single disk in turn.  If there is a 
bad disk, you will find out if the mirror fails because of a read error.  Scrub, 
though, should really be telling you everything that you need to know about disk 
failings, once the surface becomes corrupted enough that it can't be corrected 
by re-reading enough times.


It looks like you've started mirroring some of the drives.  That's really what 
you should be doing for the other non-mirror drives.


Gregg Wonderly
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Gregg Wonderly
That's why I'm asking.  I think it should always mirror the partition table and 
allocate exactly the same amount of space so that the pool doesn't suddenly 
change sizes unexpectedly and require a disk size that I don't have at hand, to 
put the mirror back up.


Gregg

On 12/18/2011 4:08 PM, Nathan Kroenert wrote:
Do note, that though Frank is correct, you have to be a little careful around 
what might happen should you drop your original disk, and only the large 
mirror half is left... ;)


On 12/16/11 07:09 PM, Frank Cusack wrote:
You can just do fdisk to create a single large partition.  The attached 
mirror doesn't have to be the same size as the first component.


On Thu, Dec 15, 2011 at 11:27 PM, Gregg Wonderly > wrote:


Cindy, will it ever be possible to just have attach mirror the surfaces,
including the partition tables?  I spent an hour today trying to get a
new mirror on my root pool.  There was a 250GB disk that failed.  I only
had a 1.5TB handy as a replacement.  prtvtoc ... | fmthard does not work
in this case and so you have to do the partitioning by hand, which is
just silly to fight with anyway.

Gregg

Sent from my iPhone

On Dec 15, 2011, at 6:13 PM, Tim Cook mailto:t...@cook.ms>>
wrote:


Do you still need to do the grub install?

On Dec 15, 2011 5:40 PM, "Cindy Swearingen" mailto:cindy.swearin...@oracle.com>> wrote:

Hi Anon,

The disk that you attach to the root pool will need an SMI label
and a slice 0.

The syntax to attach a disk to create a mirrored root pool
is like this, for example:

# zpool attach rpool c1t0d0s0 c1t1d0s0

Thanks,

Cindy

On 12/15/11 16:20, Anonymous Remailer (austria) wrote:


On Solaris 10 If I install using ZFS root on only one drive is
there a way
to add another drive as a mirror later? Sorry if this was discussed
already. I searched the archives and couldn't find the answer.
Thank you.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Peter Jeremy
On 2011-Dec-20 00:29:50 +1100, Jim Klimov  wrote:
>2011-12-19 16:58, Pawel Jakub Dawidek wrote:
>> On Mon, Dec 19, 2011 at 10:18:05AM +, Darren J Moffat wrote:
>>> For those of us not familiar with how FreeBSD is installed and boots can
>>> you explain how boot works (ie do you use GRUB at all and if so which
>>> version and where the early boot ZFS code is).
>>
>> We don't use GRUB, no. We use three stages for booting. Stage 0 is
>> bascially 512 byte of very simple MBR boot loader installed at the
>> begining of the disk that is used to launch stage 1 boot loader. Stage 1
>> is where we interpret all ZFS (or UFS) structure and read real files.
...
>Hmm... and is the freebsd-boot partition redundant somehow?

In the GPT case, each boot device would have a copy of both the boot0
MBR and a freebsd-boot partition containing gptzfsboot.  Both zfsboot
(used with traditional MBR/fdisk partitioning) and gptzfsboot
incorporate standard ZFS code and so should be able to boot off any
supported zpool type (but note that there's a bug in the handling of
gang blocks that was only fixed very recently).

>Is it mirrored or can be striped over several disks?

Effectively the boot code is mirrored on each bootdisk.  FreeBSD does
not have the same partitioned vs whole disk issues as Solaris so there
is no downside to using partitioned disks with ZFS on FreeBSD.

>I was educated that the core problem lies in the system's
>required ability to boot off any single device (including
>volumes of several disks singularly presented by HWRAIDs).
>This "BIOS boot device" should hold everything that is
>required and sufficient to go on booting the OS and using
>disk sets of some more sophisticated redundancy.

Normally, firmware boot code (BIOS, EFI, OFW etc) has no RAID ability
and needs to load bootstrap code off a single (physical or HW RAID)
boot device.  The exception is the primitive software RAID solutions
found in consumer PC hardware - which are best ignored.

Effectively, all the code needed prior to the point where a software
RAID device can be built must be replicated in full across all boot
devices.  For RAID-1, everything is already replicated so it's
sufficient to just treat one mirror as the boot device and let the
kernel build the RAID device.  For anything more complex, one of the
bootstrap stages has to build enough of the RAID device to allow the
kernel (etc) to be read out of the RAID device.

>I gather that in FreeBSD's case this "self-sufficient"
>bootloader is small and incurs a small storage overhead,
>even if cloned to a dozen disks in your array?

gptzfsboot is currently ~34KB (20KB larger than the equivalent UFS
bootstrap).  GPT has a 34-sector overhead and the freebsd-boot
partition is typically 128 sectors to allow for future growth (though
I've shrunk it at home to 94 sectors so the following partition is on
a 64KB boundary to better suit future 4KB disks).  My mirrored ZFS
system at work is partitioned as:
$ gpart show  -p
=>  34  78124933ad0  GPT  (37G)
34   128  ad0p1  freebsd-boot  (64k)
   162   5242880  ad0p2  freebsd-swap  (2.5G)
   5243042  72881925  ad0p3  freebsd-zfs  (34G)

=>  34  78124933ad1  GPT  (37G)
34   128  ad1p1  freebsd-boot  (64k)
   162   5242880  ad1p2  freebsd-swap  (2.5G)
   5243042  72881925  ad1p3  freebsd-zfs  (34G)
(The first 2 columns are absolute offset and size in sectors)
My root pool is a mirror of ad0p3 and ad1p3.

>In this case Solaris's problem with only-mirrored ZFS
>on root pools is that the "self-sufficient" quantum
>of required data is much larger; but otherwise the
>situation is the same?

If you have enough data and disk space, the overheads in combining a
mirrored root with RAIDZ data aren't that great.  At home, I have 6
1TB disks and I've carved out 8GB from the front of each (3GB for swap
and 5GB for root) and the remainder in a RAIDZ2 pool - that's less
than 1% overhead.  5GB is big enough to hold the complete source tree
and compile it, as well as the base OS.  I have a 3-way mirrored root
across half the disks and use the other "root" partitions as
"temporary" roots when upgrading.

-- 
Peter Jeremy


pgpqOpr17sU1a.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Cindy Swearingen

Hi Andrew,

Current releases that apply the bootblocks automatically during
a zpool attach operation are Oracle Solaris 10 8/11 and Oracle
Solaris 11.

Thanks,

Cindy



On 12/19/11 10:03, Daugherity, Andrew W wrote:

Does "current" include sol10u10 as well as sol11? If so, when did that
go in? Was it in sol10u9?


Thanks,

Andrew


*From: *Cindy Swearingen mailto:cindy.swearin...@oracle.com>>
*Subject: **Re: [zfs-discuss] Can I create a mirror for a root rpool?*
*Date: *December 16, 2011 10:38:21 AM CST
*To: *Tim Cook mailto:t...@cook.ms>>
*Cc: *mailto:zfs-discuss@opensolaris.org>>


Hi Tim,

No, in current Solaris releases the boot blocks are installed
automatically with a zpool attach operation on a root pool.

Thanks,

Cindy

On 12/15/11 17:13, Tim Cook wrote:

Do you still need to do the grub install?

On Dec 15, 2011 5:40 PM, "Cindy Swearingen"
mailto:cindy.swearin...@oracle.com>
> wrote:

Hi Anon,

The disk that you attach to the root pool will need an SMI label
and a slice 0.

The syntax to attach a disk to create a mirrored root pool
is like this, for example:

# zpool attach rpool c1t0d0s0 c1t1d0s0

Thanks,

Cindy




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Daugherity, Andrew W
Does "current" include sol10u10 as well as sol11?  If so, when did that go in?  
Was it in sol10u9?


Thanks,

Andrew

From: Cindy Swearingen 
mailto:cindy.swearin...@oracle.com>>
Subject: Re: [zfs-discuss] Can I create a mirror for a root rpool?
Date: December 16, 2011 10:38:21 AM CST
To: Tim Cook mailto:t...@cook.ms>>
Cc: mailto:zfs-discuss@opensolaris.org>>


Hi Tim,

No, in current Solaris releases the boot blocks are installed
automatically with a zpool attach operation on a root pool.

Thanks,

Cindy

On 12/15/11 17:13, Tim Cook wrote:
Do you still need to do the grub install?

On Dec 15, 2011 5:40 PM, "Cindy Swearingen" 
mailto:cindy.swearin...@oracle.com>
> wrote:

   Hi Anon,

   The disk that you attach to the root pool will need an SMI label
   and a slice 0.

   The syntax to attach a disk to create a mirrored root pool
   is like this, for example:

   # zpool attach rpool c1t0d0s0 c1t1d0s0

   Thanks,

   Cindy

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Cindy Swearingen

Hi Pawel,

In addition to the current SMI label requirement for booting,
I believe another limitation is that the boot info must be
contiguous.

I think an RFE is filed to relax this requirement as well.
I just can't find it right now.

Thanks,

Cindy

On 12/18/11 04:52, Pawel Jakub Dawidek wrote:

On Thu, Dec 15, 2011 at 04:39:07PM -0700, Cindy Swearingen wrote:

Hi Anon,

The disk that you attach to the root pool will need an SMI label
and a slice 0.

The syntax to attach a disk to create a mirrored root pool
is like this, for example:

# zpool attach rpool c1t0d0s0 c1t1d0s0


BTW. Can you, Cindy, or someone else reveal why one cannot boot from
RAIDZ on Solaris? Is this because Solaris is using GRUB and RAIDZ code
would have to be licensed under GPL as the rest of the boot code?

I'm asking, because I see no technical problems with this functionality.
Booting off of RAIDZ (even RAIDZ3) and also from multi-top-level-vdev
pools works just fine on FreeBSD for a long time now. Not being forced
to have dedicated pool just for the root if you happen to have more than
two disks in you box is very convenient.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA hardware advice

2011-12-19 Thread Garrett D'Amore

On Dec 19, 2011, at 7:52 AM, Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D. wrote:

> AFAIK, most ZFS based storage appliance are move to SAS with 7200 rpm or 15k 
> rpm
> most SSD are SATA and are connecting to on bd SATA with IO chips

Most *cheap* SSDs are SATA.  But if you want to use them in a cluster 
configuration, you need to use a SAS device that supports multiple initiators, 
such as those from STEC.

- Garrett
> 
> 
> On 12/19/2011 9:59 AM, tono wrote:
>> Thanks for the sugestions, especially all the HP info and build
>> pictures.
>> 
>> Two things crossed my mind on the hardware front. The first is regarding
>> the SSDs you have pictured, mounted in sleds. Any Proliant that I've
>> read about connects the hotswap drives via a SAS backplane. So how did
>> you avoid that (physically) to make the direct SATA connections?
>> 
>> The second is regarding a conversation I had with HP pre-sales. A rep
>> actually told me, in no uncertain terms, that using non-HP HBAs, RAM, or
>> drives would completely void my warranty. I assume this is BS but I
>> wonder if anyone has ever gotten resistance due to 3rd party hardware.
>> In the States, at least, there is the Magnuson–Moss act. I'm just not
>> sure if it applies to servers.
>> 
>> Back to SATA though. I can appreciate fully about not wanting to take
>> unnecessary risks, but there are a few things that don't sit well with
>> me.
>> 
>> A little background: this is to be a backup server for a small/medium
>> business. The data, of course, needs to be safe, but we don't need
>> extreme HA.
>> 
>> I'm aware of two specific issues with SATA drives: the TLER/CCTL
>> setting, and the issue with SAS expanders. I have to wonder if these
>> account for most of the bad rap that SATA drives get. Expanders are
>> built into nearly all of the JBODs and storage servers I've found
>> (including the one in the serverfault post), so they must be in common
>> use.
>> 
>> So I'll ask again: are there any issues when connecting SATA drives
>> directly to a HBA? People are, after all, talking left and right about
>> using SATA SSDs... as long as they are connected directly to the MB
>> controller.
>> 
>> We might just do SAS at this point for peace of mind. It just bugs me
>> that you can't use "inexpensive disks" in a R.A.I.D. I would think that
>> RAIDZ and AHCI could handle just about any failure mode by now.
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> -- 
> Hung-Sheng Tsao Ph D.
> Founder&  Principal
> HopBit GridComputing LLC
> cell: 9734950840
> http://laotsao.wordpress.com/
> http://laotsao.blogspot.com/
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA hardware advice

2011-12-19 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.
AFAIK, most ZFS based storage appliance are move to SAS with 7200 rpm or 
15k rpm

most SSD are SATA and are connecting to on bd SATA with IO chips


On 12/19/2011 9:59 AM, tono wrote:

Thanks for the sugestions, especially all the HP info and build
pictures.

Two things crossed my mind on the hardware front. The first is regarding
the SSDs you have pictured, mounted in sleds. Any Proliant that I've
read about connects the hotswap drives via a SAS backplane. So how did
you avoid that (physically) to make the direct SATA connections?

The second is regarding a conversation I had with HP pre-sales. A rep
actually told me, in no uncertain terms, that using non-HP HBAs, RAM, or
drives would completely void my warranty. I assume this is BS but I
wonder if anyone has ever gotten resistance due to 3rd party hardware.
In the States, at least, there is the Magnuson–Moss act. I'm just not
sure if it applies to servers.

Back to SATA though. I can appreciate fully about not wanting to take
unnecessary risks, but there are a few things that don't sit well with
me.

A little background: this is to be a backup server for a small/medium
business. The data, of course, needs to be safe, but we don't need
extreme HA.

I'm aware of two specific issues with SATA drives: the TLER/CCTL
setting, and the issue with SAS expanders. I have to wonder if these
account for most of the bad rap that SATA drives get. Expanders are
built into nearly all of the JBODs and storage servers I've found
(including the one in the serverfault post), so they must be in common
use.

So I'll ask again: are there any issues when connecting SATA drives
directly to a HBA? People are, after all, talking left and right about
using SATA SSDs... as long as they are connected directly to the MB
controller.

We might just do SAS at this point for peace of mind. It just bugs me
that you can't use "inexpensive disks" in a R.A.I.D. I would think that
RAIDZ and AHCI could handle just about any failure mode by now.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Hung-Sheng Tsao Ph D.
Founder&  Principal
HopBit GridComputing LLC
cell: 9734950840
http://laotsao.wordpress.com/
http://laotsao.blogspot.com/

<>___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA hardware advice

2011-12-19 Thread tono
Thanks for the sugestions, especially all the HP info and build
pictures.

Two things crossed my mind on the hardware front. The first is regarding
the SSDs you have pictured, mounted in sleds. Any Proliant that I've
read about connects the hotswap drives via a SAS backplane. So how did
you avoid that (physically) to make the direct SATA connections?

The second is regarding a conversation I had with HP pre-sales. A rep
actually told me, in no uncertain terms, that using non-HP HBAs, RAM, or
drives would completely void my warranty. I assume this is BS but I
wonder if anyone has ever gotten resistance due to 3rd party hardware.
In the States, at least, there is the Magnuson–Moss act. I'm just not
sure if it applies to servers.

Back to SATA though. I can appreciate fully about not wanting to take
unnecessary risks, but there are a few things that don't sit well with
me.

A little background: this is to be a backup server for a small/medium
business. The data, of course, needs to be safe, but we don't need
extreme HA.

I'm aware of two specific issues with SATA drives: the TLER/CCTL
setting, and the issue with SAS expanders. I have to wonder if these
account for most of the bad rap that SATA drives get. Expanders are
built into nearly all of the JBODs and storage servers I've found
(including the one in the serverfault post), so they must be in common
use.

So I'll ask again: are there any issues when connecting SATA drives
directly to a HBA? People are, after all, talking left and right about
using SATA SSDs... as long as they are connected directly to the MB
controller.

We might just do SAS at this point for peace of mind. It just bugs me
that you can't use "inexpensive disks" in a R.A.I.D. I would think that
RAIDZ and AHCI could handle just about any failure mode by now.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Jim Klimov

2011-12-19 16:58, Pawel Jakub Dawidek wrote:

On Mon, Dec 19, 2011 at 10:18:05AM +, Darren J Moffat wrote:

For those of us not familiar with how FreeBSD is installed and boots can
you explain how boot works (ie do you use GRUB at all and if so which
version and where the early boot ZFS code is).


We don't use GRUB, no. We use three stages for booting. Stage 0 is
bascially 512 byte of very simple MBR boot loader installed at the
begining of the disk that is used to launch stage 1 boot loader. Stage 1
is where we interpret all ZFS (or UFS) structure and read real files.
When you use GPT, there is dedicated partition (of type freebsd-boot)
where you install gptzfsboot binary (stage 0 looks for GPT partition of
type freebsd-boot, loads it and starts the code in there). This
partition doesn't contain file system of course, boot0 is too simple to
read any file system. The gptzfsboot is where we handle all ZFS-related
operations. gptzfsboot is mostly used to find root dataset and load
zfsloader from there. The zfsloader is the last stage in booting. It
shares the same ZFS-related code as gptzfsboot (but compiled into
separate binary), it loads modules and the kernel and starts it.
The zfsloader is stored in /boot/ directory on the root dataset.


Hmm... and is the freebsd-boot partition redundant somehow?
Is it mirrored or can be striped over several disks?

I was educated that the core problem lies in the system's
required ability to boot off any single device (including
volumes of several disks singularly presented by HWRAIDs).
This "BIOS boot device" should hold everything that is
required and sufficient to go on booting the OS and using
disk sets of some more sophisticated redundancy.

I gather that in FreeBSD's case this "self-sufficient"
bootloader is small and incurs a small storage overhead,
even if cloned to a dozen disks in your array?

In this case Solaris's problem with only-mirrored ZFS
on root pools is that the "self-sufficient" quantum
of required data is much larger; but otherwise the
situation is the same?

Thanks for clarifying,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!

2011-12-19 Thread Hung-Sheng Tsao (laoTsao)
not sure oi support shadow migration 
or you may be to send zpool to another server then send back to  do defrag
regards

Sent from my iPad

On Dec 19, 2011, at 8:15, Gary Mills  wrote:

> On Mon, Dec 19, 2011 at 11:58:57AM +, Jan-Aage Frydenbø-Bruvoll wrote:
>> 
>> 2011/12/19 Hung-Sheng Tsao (laoTsao) :
>>> did you run a scrub?
>> 
>> Yes, as part of the previous drive failure. Nothing reported there.
>> 
>> Now, interestingly - I deleted two of the oldest snapshots yesterday,
>> and guess what - the performance went back to normal for a while. Now
>> it is severely dropping again - after a good while on 1.5-2GB/s I am
>> again seeing write performance in the 1-10MB/s range.
> 
> That behavior is a symptom of fragmentation.  Writes slow down
> dramatically when there are no contiguous blocks available.  Deleting
> a snapshot provides some of these, but only temporarily.
> 
> -- 
> -Gary Mills--refurb--Winnipeg, Manitoba, Canada-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!

2011-12-19 Thread Jim Klimov

2011-12-19 2:53, Jan-Aage Frydenbø-Bruvoll пишет:

On Sun, Dec 18, 2011 at 22:14, Nathan Kroenert  wrote:

Do you realise that losing a single disk in that pool could pretty much
render the whole thing busted?


Ah - didn't pick up on that one until someone here pointed it out -
all my disks are mirrored, however some of them are mirrored on the
controller level.


The problem somewhat remains: it is unknown (to us and to ZFS)
*how* the disks are mirrored by hardware. For example, if a
single-sector error exists, would the controller detect it
quickly? Would it choose the good copy correctly or use the
"first disk" blindly, for the lack of other clues?

Many RAID controllers are relatively dumb in what they do,
and if an error does get detected, the whole problematic
disk is overwritten. This is long, error-prone (if the other
disk in the pair is also imperfect), and has a tendency to
ignore small errors - such as those detected by ZFS with
its per-block checksums.

So, in case of one HW disk having an error, you might be
having random data presented by the HW mirror. Since in
your case ZFS is only used to stripe over HW mirrors,
it has no redundancy to intelligently detect and fix
such "small errors". And depending on the error's
location in the block tree, the problem might range
from ignorable to fatal.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!

2011-12-19 Thread Gary Mills
On Mon, Dec 19, 2011 at 11:58:57AM +, Jan-Aage Frydenbø-Bruvoll wrote:
> 
> 2011/12/19 Hung-Sheng Tsao (laoTsao) :
> > did you run a scrub?
> 
> Yes, as part of the previous drive failure. Nothing reported there.
> 
> Now, interestingly - I deleted two of the oldest snapshots yesterday,
> and guess what - the performance went back to normal for a while. Now
> it is severely dropping again - after a good while on 1.5-2GB/s I am
> again seeing write performance in the 1-10MB/s range.

That behavior is a symptom of fragmentation.  Writes slow down
dramatically when there are no contiguous blocks available.  Deleting
a snapshot provides some of these, but only temporarily.

-- 
-Gary Mills--refurb--Winnipeg, Manitoba, Canada-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!

2011-12-19 Thread Jim Klimov

2011-12-19 2:00, Fajar A. Nugraha wrote:

 From http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
(or at least Google's cache of it, since it seems to be inaccessible
now:

"
Keep pool space under 80% utilization to maintain pool performance.
Currently, pool performance can degrade when a pool is very full and
file systems are updated frequently, such as on a busy mail server.
Full pools might cause a performance penalty, but no other issues. If
the primary workload is immutable files (write once, never remove),
then you can keep a pool in the 95-96% utilization range. Keep in mind
that even with mostly static content in the 95-96% range, write, read,
and resilvering performance might suffer.
"



This reminds me that I had a question :)

If I were to "reserve" space on a pool by creating a dataset
with a reservation totalling, say, 20% of all pool size -
but otherwise keep this dataset empty - would it help the
pool to maintain performance until the rest of the pool is
100% full (or the said 80% of total pool size)? Technically
the pool would always have large empty slabs, but would be
forbidden to write more than 80% of pool size...

Basically this should be equivalent for "root-reserved 5%"
on traditional FSes like UFS, EXT3, etc. Would it be indeed?

Thanks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] very slow write performance on 151a

2011-12-19 Thread Jim Klimov

2011-12-15 22:44, milosz цкщеу:

There are a few metaslab-related tunables that can be tweaked as well.
- Bill


For the sake of completeness, here are the relevant lines
I have in /etc/system:

**
* fix up metaslab min size (recent default ~10Mb seems bad,
* recommended return to 4Kb, we'll do 4*8K)
* greatly increases write speed in filled-up pools
set zfs:metaslab_min_alloc_size = 0x8000
set zfs:metaslab_smo_bonus_pct = 0xc8
**

These values were described in greater detail on the list
this summer, I think.

HTH,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Pawel Jakub Dawidek
On Mon, Dec 19, 2011 at 10:18:05AM +, Darren J Moffat wrote:
> On 12/18/11 11:52, Pawel Jakub Dawidek wrote:
> > On Thu, Dec 15, 2011 at 04:39:07PM -0700, Cindy Swearingen wrote:
> >> Hi Anon,
> >>
> >> The disk that you attach to the root pool will need an SMI label
> >> and a slice 0.
> >>
> >> The syntax to attach a disk to create a mirrored root pool
> >> is like this, for example:
> >>
> >> # zpool attach rpool c1t0d0s0 c1t1d0s0
> >
> > BTW. Can you, Cindy, or someone else reveal why one cannot boot from
> > RAIDZ on Solaris? Is this because Solaris is using GRUB and RAIDZ code
> > would have to be licensed under GPL as the rest of the boot code?
> >
> > I'm asking, because I see no technical problems with this functionality.
> > Booting off of RAIDZ (even RAIDZ3) and also from multi-top-level-vdev
> > pools works just fine on FreeBSD for a long time now. Not being forced
> > to have dedicated pool just for the root if you happen to have more than
> > two disks in you box is very convenient.
> 
> For those of us not familiar with how FreeBSD is installed and boots can 
> you explain how boot works (ie do you use GRUB at all and if so which 
> version and where the early boot ZFS code is).

We don't use GRUB, no. We use three stages for booting. Stage 0 is
bascially 512 byte of very simple MBR boot loader installed at the
begining of the disk that is used to launch stage 1 boot loader. Stage 1
is where we interpret all ZFS (or UFS) structure and read real files.
When you use GPT, there is dedicated partition (of type freebsd-boot)
where you install gptzfsboot binary (stage 0 looks for GPT partition of
type freebsd-boot, loads it and starts the code in there). This
partition doesn't contain file system of course, boot0 is too simple to
read any file system. The gptzfsboot is where we handle all ZFS-related
operations. gptzfsboot is mostly used to find root dataset and load
zfsloader from there. The zfsloader is the last stage in booting. It
shares the same ZFS-related code as gptzfsboot (but compiled into
separate binary), it loads modules and the kernel and starts it.
The zfsloader is stored in /boot/ directory on the root dataset.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpRIZh6GXH13.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!

2011-12-19 Thread Jan-Aage Frydenbø-Bruvoll
Hi,

2011/12/19 Hung-Sheng Tsao (laoTsao) :
> what is the ram size?

32 GB

> are there many snap? create then delete?

Currently, there are 36 snapshots on the pool - it is part of a fairly
normal backup regime of snapshots every 5 min, hour, day, week and
month.

> did you run a scrub?

Yes, as part of the previous drive failure. Nothing reported there.

Now, interestingly - I deleted two of the oldest snapshots yesterday,
and guess what - the performance went back to normal for a while. Now
it is severely dropping again - after a good while on 1.5-2GB/s I am
again seeing write performance in the 1-10MB/s range.

Is there an upper limit on the number of snapshots on a ZFS pool?

Best regards
Jan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!

2011-12-19 Thread Hung-Sheng Tsao (laoTsao)
what is the ram size?
are there many snap? create then delete?
did you run a scrub?

Sent from my iPad

On Dec 18, 2011, at 10:46, Jan-Aage Frydenbø-Bruvoll  wrote:

> Hi,
> 
> On Sun, Dec 18, 2011 at 15:13, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."
>  wrote:
>> what are the output of zpool status pool1 and pool2
>> it seems that you have mix configuration of pool3 with disk and mirror
> 
> The other two pools show very similar outputs:
> 
> root@stor:~# zpool status pool1
>  pool: pool1
> state: ONLINE
> scan: resilvered 1.41M in 0h0m with 0 errors on Sun Dec  4 17:42:35 2011
> config:
> 
>NAME  STATE READ WRITE CKSUM
>pool1  ONLINE   0 0 0
>  mirror-0ONLINE   0 0 0
>c1t12d0   ONLINE   0 0 0
>c1t13d0   ONLINE   0 0 0
>  mirror-1ONLINE   0 0 0
>c1t24d0   ONLINE   0 0 0
>c1t25d0   ONLINE   0 0 0
>  mirror-2ONLINE   0 0 0
>c1t30d0   ONLINE   0 0 0
>c1t31d0   ONLINE   0 0 0
>  mirror-3ONLINE   0 0 0
>c1t32d0   ONLINE   0 0 0
>c1t33d0   ONLINE   0 0 0
>logs
>  mirror-4ONLINE   0 0 0
>c2t2d0p6  ONLINE   0 0 0
>c2t3d0p6  ONLINE   0 0 0
>cache
>  c2t2d0p10   ONLINE   0 0 0
>  c2t3d0p10   ONLINE   0 0 0
> 
> errors: No known data errors
> root@stor:~# zpool status pool2
>  pool: pool2
> state: ONLINE
> scan: scrub canceled on Wed Dec 14 07:51:50 2011
> config:
> 
>NAME  STATE READ WRITE CKSUM
>pool2 ONLINE   0 0 0
>  mirror-0ONLINE   0 0 0
>c1t14d0   ONLINE   0 0 0
>c1t15d0   ONLINE   0 0 0
>  mirror-1ONLINE   0 0 0
>c1t18d0   ONLINE   0 0 0
>c1t19d0   ONLINE   0 0 0
>  mirror-2ONLINE   0 0 0
>c1t20d0   ONLINE   0 0 0
>c1t21d0   ONLINE   0 0 0
>  mirror-3ONLINE   0 0 0
>c1t22d0   ONLINE   0 0 0
>c1t23d0   ONLINE   0 0 0
>logs
>  mirror-4ONLINE   0 0 0
>c2t2d0p7  ONLINE   0 0 0
>c2t3d0p7  ONLINE   0 0 0
>cache
>  c2t2d0p11   ONLINE   0 0 0
>  c2t3d0p11   ONLINE   0 0 0
> 
> The affected pool does indeed have a mix of straight disks and
> mirrored disks (due to running out of vdevs on the controller),
> however it has to be added that the performance of the affected pool
> was excellent until around 3 weeks ago, and there have been no
> structural changes nor to the pools neither to anything else on this
> server in the last half year or so.
> 
> -jan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Darren J Moffat

On 12/18/11 11:52, Pawel Jakub Dawidek wrote:

On Thu, Dec 15, 2011 at 04:39:07PM -0700, Cindy Swearingen wrote:

Hi Anon,

The disk that you attach to the root pool will need an SMI label
and a slice 0.

The syntax to attach a disk to create a mirrored root pool
is like this, for example:

# zpool attach rpool c1t0d0s0 c1t1d0s0


BTW. Can you, Cindy, or someone else reveal why one cannot boot from
RAIDZ on Solaris? Is this because Solaris is using GRUB and RAIDZ code
would have to be licensed under GPL as the rest of the boot code?

I'm asking, because I see no technical problems with this functionality.
Booting off of RAIDZ (even RAIDZ3) and also from multi-top-level-vdev
pools works just fine on FreeBSD for a long time now. Not being forced
to have dedicated pool just for the root if you happen to have more than
two disks in you box is very convenient.


For those of us not familiar with how FreeBSD is installed and boots can 
you explain how boot works (ie do you use GRUB at all and if so which 
version and where the early boot ZFS code is).


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss