Re: [zfs-discuss] RAIDZ versus mirrroed

2009-09-18 Thread Brandon High
On Thu, Sep 17, 2009 at 11:41 AM, Adam Leventhal a...@eng.sun.com wrote:
  RAID-3        bit-interleaved parity (basically not used)

There was a hardware RAID chipset that used RAID-3. Netcell Revolution
I think it was called.

It looked interesting and I thought about grabbing one at the time but
never got around to it. Netcell is defunct or got bought out, so the
controller is no longer available.

-B

-- 
Brandon High : bh...@freaks.com
Always try to do things in chronological order; it's less confusing that way.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4540 dead HDD replacement, remains configured.

2009-09-18 Thread John Ryan
I have exactly these symptoms on 3 thumpers now.
2 x x4540s and 1 x x4500
Rebooting/Power cycling doesn't even bring them back. The only thing I found, 
is that if I boot from the osol.2009.06 Cd, I can see all the drives
I had to reinstall the OS on one box.

I've only just recently upgraded them to snv_122. Before that, I could change 
disks without problems.
Could it be something introduced since snv_111?

John
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Andrew Deason
On Thu, 17 Sep 2009 18:40:49 -0400
Robert Milkowski mi...@task.gda.pl wrote:

 if you would create a dedicated dataset for your cache and set quota
 on it then instead of tracking a disk space usage for each file you
 could easily check how much disk space is being used in the dataset.
 Would it suffice for you?

No. We need to be able to tell how close to full we are, for determining
when to start/stop removing things from the cache before we can add new
items to the cache again.

I'd also _like_ not to require a dedicated dataset for it, but it's not
like it's difficult for users to create one.

 Setting recordsize to 1k if you have lots of files (I assume) larger 
 than that doesn't really make sense.
 The problem with metadata is that by default it is also compressed so 
 there is no easy way to tell how much disk space it occupies for a 
 specified file using standard API.

We do not know in advance what file sizes we'll be seeing in general. We
could of course tell people to tune the cache dataset according to their
usage pattern, but I don't think users are generally going to know what
their cache usage pattern looks like.

I can say that at least right now, usually each file will be at most 1M
long (1M is the max unless the user specifically changes it). But
between the range 1k-1M, I don't know what the distribution looks like.

I can't get an /estimate/ on the data+metadata disk usage? What about in
the hypothetical case of the metadata compression ratio being
effectively the same as without compression, what would it be then?

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] deduplication

2009-09-18 Thread Blake
Thanks James!  I look forward to these - we could really use dedup in my org.

Blake

On Thu, Sep 17, 2009 at 6:02 PM, James C. McPherson
james.mcpher...@sun.com wrote:
 On Thu, 17 Sep 2009 11:50:17 -0500
 Tim Cook t...@cook.ms wrote:

 On Thu, Sep 17, 2009 at 5:27 AM, Thomas Burgess wonsl...@gmail.com wrote:

 
  I think you're right, and i also think we'll still see a new post asking
  about it once or twice a week.
 [snip]
 As we should.  Did the video of the talks about dedup ever even get posted
 to Sun's site?  I never saw it.  I remember being told we were all idiots
 when pointing out that it had mysteriously not been posted...

 Hi Tim,
 I certainly do not recall calling anybody an idiot for asking
 about the video or slideware.


 I definitely _do_ recall asking for people to be patient because

 (1) we had lighting problems with the auditorium which interfered
    with recording video

 (2) we have been getting the videos professionally edited so that
    when we can put them up on an appropriate site (which I imagine
    will be slx.sun.com), then the vids will adhere to the high
    standards which you have come to expect.

 (3) professional editing of videos takes time and money. We are
    getting this done as fast as we can.


 I asked Deirdre about the videos yesterday, she said that they
 are almost ready. Rest assured that when they are ready I will
 announce their availability as soon as I possibly can.


 James C. McPherson
 --
 Senior Kernel Software Engineer, Solaris
 Sun Microsystems
 http://blogs.sun.com/jmcp       http://www.jmcp.homeunix.com/blog
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool UNAVAIL even though disk is online: another label issue?

2009-09-18 Thread michael schuster

All,

this morning, I did pkg image-update from 118 to 123 (internal repo), and 
upon reboot all I got was the grub prompt - no menu, nothing.


I found a 2009.06 CD, and when I boot that and run zpool import, I
get told

localtank   UNAVAIL  insufficient replicas
  c8t1d0ONLINE

some research showed that disklabel changes sometimes cause this, so I ran 
format:


AVAILABLE DISK SELECTIONS:
   0. c8t0d0 DEFAULT cyl 48639 alt 2 hd 255 sec 63
  /p...@0,0/pci108e,5...@7/d...@0,0
   1. c8t1d0 ATA-HITACHI HDS7240S-A33A-372.61GB
  /p...@0,0/pci108e,5...@7/d...@1,0
Specify disk (enter its number): 1
selecting c8t1d0
[disk formatted]
Note: capacity in disk label is smaller than the real disk capacity.
Select partition expand to adjust the label capacity.

[..]
partition print
Current partition table (original):
Total disk sectors available: 781401310 + 16384 (reserved sectors)

Part  TagFlag First Sector Size Last Sector
  0usrwm   256  372.60GB  781401310
  1 unassignedwm 0   0   0
  2 unassignedwm 0   0   0
  3 unassignedwm 0   0   0
  4 unassignedwm 0   0   0
  5 unassignedwm 0   0   0
  6 unassignedwm 0   0   0
  8   reservedwm 7814013118.00MB  781417694


Format already tells me that the label doesn't align with the disk size ... 
 should I just do expand, or should I change the first sectore of 
partition 0 to be 0?

 I'd appreciate advice on the above, and on how to avoid this in the future.
--
Michael Schuster http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Andrew Deason
On Fri, 18 Sep 2009 12:48:34 -0400
Richard Elling richard.ell...@gmail.com wrote:

 The transactional nature of ZFS may work against you here.
 Until the data is committed to disk, it is unclear how much space
 it will consume. Compression clouds the crystal ball further.

...but not impossible. I'm just looking for a reasonable upper bound.
For example, if I always rounded up to the next 128k mark, and added an
additional 128k, that would always give me an upper bound (for files =
1M), as far as I can tell. But that is not a very tight bound; can you
suggest anything better?

  I'd also _like_ not to require a dedicated dataset for it, but
  it's not
  like it's difficult for users to create one.
 
 Use delegation.  Users can create their own datasets, set parameters,
 etc. For this case, you could consider changing recordsize, if you
 really are so worried about 1k. IMHO, it is easier and less expensive
 in process and pain to just buy more disk when needed.

Users of OpenAFS, not unprivileged users. All users I am talking about
are the administrators for their machines. I would just like to reduce
the number of filesystem-specific steps needed to be taken to set up the
cache. You don't need to do anything special for a tmpfs cache, for
instance, or ext2/3 caches on linux.

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] addendum: zpool UNAVAIL even though disk is online: another label issue?

2009-09-18 Thread michael schuster

michael schuster wrote:

All,

this morning, I did pkg image-update from 118 to 123 (internal repo), 
and upon reboot all I got was the grub prompt - no menu, nothing.


I found a 2009.06 CD, and when I boot that and run zpool import, I
get told

localtank   UNAVAIL  insufficient replicas
  c8t1d0ONLINE

some research showed that disklabel changes sometimes cause this, so I 
ran format:


AVAILABLE DISK SELECTIONS:
   0. c8t0d0 DEFAULT cyl 48639 alt 2 hd 255 sec 63
  /p...@0,0/pci108e,5...@7/d...@0,0
   1. c8t1d0 ATA-HITACHI HDS7240S-A33A-372.61GB
  /p...@0,0/pci108e,5...@7/d...@1,0
Specify disk (enter its number): 1
selecting c8t1d0
[disk formatted]
Note: capacity in disk label is smaller than the real disk capacity.
Select partition expand to adjust the label capacity.

[..]
partition print
Current partition table (original):
Total disk sectors available: 781401310 + 16384 (reserved sectors)

Part  TagFlag First Sector Size Last Sector
  0usrwm   256  372.60GB  781401310
  1 unassignedwm 0   0   0
  2 unassignedwm 0   0   0
  3 unassignedwm 0   0   0
  4 unassignedwm 0   0   0
  5 unassignedwm 0   0   0
  6 unassignedwm 0   0   0
  8   reservedwm 7814013118.00MB  781417694


Format already tells me that the label doesn't align with the disk size 
...  should I just do expand, or should I change the first sectore of 
partition 0 to be 0?
 I'd appreciate advice on the above, and on how to avoid this in the 
future.


I just found out that this disk has been EFI-labelled, which I understand 
isn't what zfs like/expects.


what to do now?

TIA
Michael
--
Michael Schuster http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] snv_XXX features / fixes - Solaris 10 version

2009-09-18 Thread Chris Banal
Since most zfs features / fixes are reported in snv_XXX terms. Is there some
sort of way to figure out which versions of Solaris 10 have the equivalent
features / fixes?

Thanks,
Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] addendum: zpool UNAVAIL even though disk is online: another label issue?

2009-09-18 Thread Cindy Swearingen

Michael,

ZFS handles EFI labels just fine, but you need an SMI label on the disk 
that you are booting from.


Are you saying that localtank is your root pool?

I believe the OSOL install creates a root pool called rpool. I don't 
remember if its configurable.


Changing labels or partitions from beneath a live pool isn't supported 
and can cause data loss.


Can you describe the changes other than the pkg-image-update that lead 
up to this problem?


Cindy

On 09/18/09 11:05, michael schuster wrote:

michael schuster wrote:

All,

this morning, I did pkg image-update from 118 to 123 (internal 
repo), and upon reboot all I got was the grub prompt - no menu, nothing.


I found a 2009.06 CD, and when I boot that and run zpool import, I
get told

localtank   UNAVAIL  insufficient replicas
  c8t1d0ONLINE

some research showed that disklabel changes sometimes cause this, so 
I ran format:


AVAILABLE DISK SELECTIONS:
   0. c8t0d0 DEFAULT cyl 48639 alt 2 hd 255 sec 63
  /p...@0,0/pci108e,5...@7/d...@0,0
   1. c8t1d0 ATA-HITACHI HDS7240S-A33A-372.61GB
  /p...@0,0/pci108e,5...@7/d...@1,0
Specify disk (enter its number): 1
selecting c8t1d0
[disk formatted]
Note: capacity in disk label is smaller than the real disk capacity.
Select partition expand to adjust the label capacity.

[..]
partition print
Current partition table (original):
Total disk sectors available: 781401310 + 16384 (reserved sectors)

Part  TagFlag First Sector Size Last Sector
  0usrwm   256  372.60GB  781401310
  1 unassignedwm 0   0   0
  2 unassignedwm 0   0   0
  3 unassignedwm 0   0   0
  4 unassignedwm 0   0   0
  5 unassignedwm 0   0   0
  6 unassignedwm 0   0   0
  8   reservedwm 7814013118.00MB  781417694


Format already tells me that the label doesn't align with the disk 
size ...  should I just do expand, or should I change the first 
sectore of partition 0 to be 0?
 I'd appreciate advice on the above, and on how to avoid this in the 
future.


I just found out that this disk has been EFI-labelled, which I 
understand isn't what zfs like/expects.


what to do now?

TIA
Michael


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] If you have ZFS in production, willing to share some details (with me)?

2009-09-18 Thread Steffen Weiberle

I am trying to compile some deployment scenarios of ZFS.

If you are running ZFS in production, would you be willing to provide 
(publicly or privately)?


# of systems
amount of storage
application profile(s)
type of workload (low, high; random, sequential; read-only, read-write, 
write-only)

storage type(s)
industry
whether it is private or I can share in a summary
anything else that might be of interest

Thanks in advance!!

Steffen
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] addendum: zpool UNAVAIL even though disk is online: another label issue?

2009-09-18 Thread michael schuster

Cindy Swearingen wrote:

Michael,

ZFS handles EFI labels just fine, but you need an SMI label on the disk 
that you are booting from.


Are you saying that localtank is your root pool?


no... (I was on the plane yesterday, I'm still jet-lagged), I should have 
realised that that's strange.


I believe the OSOL install creates a root pool called rpool. I don't 
remember if its configurable.


I didn't do anything to change that. This leads me to the assumption that 
the disk I should be looking at is actually c8t0d0, the other disk in the 
format output.



Can you describe the changes other than the pkg-image-update that lead 
up to this problem?



0) pkg refresh; pkg install SUNWipkg
1) pkg image-update (creates opensolaris-119)
2) pkg mount opensolaris-119 /mnt
3) cat /mnt/etc/release (to verify I'd indeed installed b123)
4) pkg umount opensolaris-119
5) pkg rename opensolaris-119 opensolaris-123 # this failed, because it's 
active

6) pkg activate opensolaris-118   # so I can rename the new one
7) pkg rename ...
8) pkg activate opensolaris-123

9) reboot

thx
Michael


Cindy

On 09/18/09 11:05, michael schuster wrote:

michael schuster wrote:

All,

this morning, I did pkg image-update from 118 to 123 (internal 
repo), and upon reboot all I got was the grub prompt - no menu, nothing.


I found a 2009.06 CD, and when I boot that and run zpool import, I
get told

localtank   UNAVAIL  insufficient replicas
  c8t1d0ONLINE

some research showed that disklabel changes sometimes cause this, so 
I ran format:


AVAILABLE DISK SELECTIONS:
   0. c8t0d0 DEFAULT cyl 48639 alt 2 hd 255 sec 63
  /p...@0,0/pci108e,5...@7/d...@0,0
   1. c8t1d0 ATA-HITACHI HDS7240S-A33A-372.61GB
  /p...@0,0/pci108e,5...@7/d...@1,0
Specify disk (enter its number): 1
selecting c8t1d0
[disk formatted]
Note: capacity in disk label is smaller than the real disk capacity.
Select partition expand to adjust the label capacity.

[..]
partition print
Current partition table (original):
Total disk sectors available: 781401310 + 16384 (reserved sectors)

Part  TagFlag First Sector Size Last Sector
  0usrwm   256  372.60GB  781401310
  1 unassignedwm 0   0   0
  2 unassignedwm 0   0   0
  3 unassignedwm 0   0   0
  4 unassignedwm 0   0   0
  5 unassignedwm 0   0   0
  6 unassignedwm 0   0   0
  8   reservedwm 7814013118.00MB  781417694


Format already tells me that the label doesn't align with the disk 
size ...  should I just do expand, or should I change the first 
sectore of partition 0 to be 0?
 I'd appreciate advice on the above, and on how to avoid this in the 
future.


I just found out that this disk has been EFI-labelled, which I 
understand isn't what zfs like/expects.


what to do now?

TIA
Michael





--
Michael Schuster http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Crazy Phantom Zpools Again

2009-09-18 Thread Dave Abrahams
I just did a fresh reinstall of OpenSolaris and I'm again seeing
the phenomenon described in 
http://article.gmane.org/gmane.os.solaris.opensolaris.zfs/26259
which I posted many months ago and got no reply to.

Can someone *please* help me figure out what's going on here?

Thanks in Advance,
--
Dave Abrahams
BoostPro Computing
http://boostpro.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] If you have ZFS in production, willing to share some details (with me)?

2009-09-18 Thread Jeremy Kister

On 9/18/2009 1:51 PM, Steffen Weiberle wrote:

I am trying to compile some deployment scenarios of ZFS.

# of systems


do zfs root count?  or only big pools?


amount of storage


raw or after parity ?


--

Jeremy Kister
http://jeremy.kister.net./
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Richard Elling

On Sep 18, 2009, at 7:36 AM, Andrew Deason wrote:


On Thu, 17 Sep 2009 18:40:49 -0400
Robert Milkowski mi...@task.gda.pl wrote:


if you would create a dedicated dataset for your cache and set quota
on it then instead of tracking a disk space usage for each file you
could easily check how much disk space is being used in the dataset.
Would it suffice for you?


No. We need to be able to tell how close to full we are, for  
determining
when to start/stop removing things from the cache before we can add  
new

items to the cache again.


The transactional nature of ZFS may work against you here.
Until the data is committed to disk, it is unclear how much space
it will consume. Compression clouds the crystal ball further.



I'd also _like_ not to require a dedicated dataset for it, but it's  
not

like it's difficult for users to create one.


Use delegation.  Users can create their own datasets, set parameters,
etc. For this case, you could consider changing recordsize, if you  
really

are so worried about 1k. IMHO, it is easier and less expensive in
process and pain to just buy more disk when needed.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] addendum: zpool UNAVAIL even though disk is online: another label issue?

2009-09-18 Thread michael schuster

Cindy Swearingen wrote:

Michael,

Get some rest. :-)

Then see if you can import your root pool while booted from the LiveCD.


that's what I tried - I'm never even shown rpool, I probably wouldn't 
have mentioned localpool at all if I had ;-)


After you get to that point, you might search the indiana-discuss 
archive for tips on

resolving the pkg-image-update no grub menu problem.


if I don't see rpool, that's not going to be the next step for me, right?

thx
Michael


Cindy

On 09/18/09 12:08, michael schuster wrote:

Cindy Swearingen wrote:

Michael,

ZFS handles EFI labels just fine, but you need an SMI label on the 
disk that you are booting from.


Are you saying that localtank is your root pool?


no... (I was on the plane yesterday, I'm still jet-lagged), I should 
have realised that that's strange.


I believe the OSOL install creates a root pool called rpool. I don't 
remember if its configurable.


I didn't do anything to change that. This leads me to the assumption 
that the disk I should be looking at is actually c8t0d0, the other 
disk in the format output.



Can you describe the changes other than the pkg-image-update that 
lead up to this problem?



0) pkg refresh; pkg install SUNWipkg
1) pkg image-update (creates opensolaris-119)
2) pkg mount opensolaris-119 /mnt
3) cat /mnt/etc/release (to verify I'd indeed installed b123)
4) pkg umount opensolaris-119
5) pkg rename opensolaris-119 opensolaris-123 # this failed, because 
it's active

6) pkg activate opensolaris-118   # so I can rename the new one
7) pkg rename ...
8) pkg activate opensolaris-123

9) reboot

thx
Michael


Cindy

On 09/18/09 11:05, michael schuster wrote:

michael schuster wrote:

All,

this morning, I did pkg image-update from 118 to 123 (internal 
repo), and upon reboot all I got was the grub prompt - no menu, 
nothing.


I found a 2009.06 CD, and when I boot that and run zpool import, I
get told

localtank   UNAVAIL  insufficient replicas
  c8t1d0ONLINE

some research showed that disklabel changes sometimes cause this, 
so I ran format:


AVAILABLE DISK SELECTIONS:
   0. c8t0d0 DEFAULT cyl 48639 alt 2 hd 255 sec 63
  /p...@0,0/pci108e,5...@7/d...@0,0
   1. c8t1d0 ATA-HITACHI HDS7240S-A33A-372.61GB
  /p...@0,0/pci108e,5...@7/d...@1,0
Specify disk (enter its number): 1
selecting c8t1d0
[disk formatted]
Note: capacity in disk label is smaller than the real disk capacity.
Select partition expand to adjust the label capacity.

[..]
partition print
Current partition table (original):
Total disk sectors available: 781401310 + 16384 (reserved sectors)

Part  TagFlag First Sector Size Last 
Sector
  0usrwm   256  372.60GB  
781401310

  1 unassignedwm 0   0   0
  2 unassignedwm 0   0   0
  3 unassignedwm 0   0   0
  4 unassignedwm 0   0   0
  5 unassignedwm 0   0   0
  6 unassignedwm 0   0   0
  8   reservedwm 7814013118.00MB  
781417694



Format already tells me that the label doesn't align with the disk 
size ...  should I just do expand, or should I change the first 
sectore of partition 0 to be 0?
 I'd appreciate advice on the above, and on how to avoid this in 
the future.


I just found out that this disk has been EFI-labelled, which I 
understand isn't what zfs like/expects.


what to do now?

TIA
Michael










--
Michael Schuster http://blogs.sun.com/recursion
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS HW RAID

2009-09-18 Thread Lloyd H. Gill

Hello folks,

I am sure this topic has been asked, but I am new to this list. I have read
a ton of doc¹s on the web, but wanted to get some opinions from you all.
Also, if someone has a digest of the last time this was discussed, you can
just send that to me. In any case, I am reading a lot of mixed reviews
related to ZFS on HW RAID devices.

The Sun docs seem to indicate it possible, but not a recommended course. I
realize there are some advantages, such as snapshots, etc. But, the h/w raid
will handle Œmost¹ disk problems, basically reducing the great capabilities
of the big reasons to deploy zfs. One suggestion would be to create the h/w
RAID LUNs as usual, present them to the OS, then do simple striping with
ZFS. Here are my two applications, where I am presented with this
possibility:

Sun Messaging Environment:
We currently use EMC storage. The storage team manages all Enterprise
storage. We currently have 10x300gb UFS mailstores presented to the OS. Each
LUN is a HW RAID 5 device. We will be upgrading the application and doing a
hardware refresh of this environment, which will give us the chance to move
to ZFS, but stay on EMC storage. I am sure the storage team will not want to
present us with JBOD. It is there practice to create the HW LUNs and present
them to the application teams. I don¹t want to end up with a complicated
scenario, but would like to leverage the most I can with ZFS, but on the EMC
array as I mentioned.

Sun Directory Environment:
The directory team is running HP DL385 G2, which also has a built-in HW RAID
controller for 5 internal SAS disks. The team currently has DS5.2 deployed
on RHEL3, but as we move to DS6.3.1, they may want to move to Solaris 10. We
have an opportunity to move to ZFS in this environment, but am curious how
to best leverage ZFS capabilities in this scenario. JBOD is very clear, but
a lot of manufacturers out there are still offering HW RAID technologies,
with high-speed caches. Using ZFS with these is not very clear to me, and as
I mentioned, there are very mixed reviews, not on ZFS features, but how it¹s
used in HW RAID settings.

Thanks for any observations.

Lloyd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Robert Milkowski

Andrew Deason wrote:

On Thu, 17 Sep 2009 18:40:49 -0400
Robert Milkowski mi...@task.gda.pl wrote:

  

if you would create a dedicated dataset for your cache and set quota
on it then instead of tracking a disk space usage for each file you
could easily check how much disk space is being used in the dataset.
Would it suffice for you?



No. We need to be able to tell how close to full we are, for determining
when to start/stop removing things from the cache before we can add new
items to the cache again.
  


but having a dedicated dataset will let you answer such a question 
immediatelly as then you get from zfs information from for the dataset 
on how much space is used (everything: data + metadata) and how much is 
left.



I'd also _like_ not to require a dedicated dataset for it, but it's not
like it's difficult for users to create one.

  

no, it is not.

Setting recordsize to 1k if you have lots of files (I assume) larger 
than that doesn't really make sense.
The problem with metadata is that by default it is also compressed so 
there is no easy way to tell how much disk space it occupies for a 
specified file using standard API.



We do not know in advance what file sizes we'll be seeing in general. We
could of course tell people to tune the cache dataset according to their
usage pattern, but I don't think users are generally going to know what
their cache usage pattern looks like.

I can say that at least right now, usually each file will be at most 1M
long (1M is the max unless the user specifically changes it). But
between the range 1k-1M, I don't know what the distribution looks like.

  
What I meant was that I believe that default recordsize of 128k should 
be fine for you (files smaller than 128k will use smaller recordsize, 
larger ones will use a recordsize of 128k). The only problem will be 
with files truncated to 0 and growing again as they will be stuck with 
an old recordsize. But in most cases it won't probably be a practical 
problem anyway.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS HW RAID

2009-09-18 Thread Bob Friesenhahn

On Fri, 18 Sep 2009, Lloyd H. Gill wrote:


The Sun docs seem to indicate it possible, but not a recommended course. I
realize there are some advantages, such as snapshots, etc. But, the h/w raid
will handle most disk problems, basically reducing the great capabilities
of the big reasons to deploy zfs. One suggestion would be to create the h/w
RAID LUNs as usual, present them to the OS, then do simple striping with
ZFS.


ZFS will catch issues that the H/W RAID will not.  Other than this, 
there is nothing inherently wrong with the simple striping with ZFS 
as long as you are confident about your SAN device.  If your SAN 
device fails, the whole ZFS pool may be lost, and if the failure is 
temporary, then the pool will be down until the SAN is restored.


If you care to keep your pool up and alive as much as possible, then 
mirroring across SAN devices is recommended.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS HW RAID

2009-09-18 Thread Robert Milkowski

Hi,

see comments inline:

Lloyd H. Gill wrote:


Hello folks,

I am sure this topic has been asked, but I am new to this list. I have 
read a ton of doc’s on the web, but wanted to get some opinions from 
you all. Also, if someone has a digest of the last time this was 
discussed, you can just send that to me. In any case, I am reading a 
lot of mixed reviews related to ZFS on HW RAID devices.


The Sun docs seem to indicate it possible, but not a recommended 
course. I realize there are some advantages, such as snapshots, etc. 
But, the h/w raid will handle ‘most’ disk problems, basically reducing 
the great capabilities of the big reasons to deploy zfs. One 
suggestion would be to create the h/w RAID LUNs as usual, present them 
to the OS, then do simple striping with ZFS. Here are my two 
applications, where I am presented with this possibility:


Of course you can use zfs on disk arrays with RAID done in HW, and you 
still will be able to use most of ZFS features including snapshots, 
clones, compression, etc.


It is not recommended in that sense that unless ZFS has a pool in 
redundant configuration from ZFS point of view it won't be able to heal 
corrupted blocks if they occur (but will be able to detect them). Most 
other filesystem in a market won't even detect such a case not to 
mention repair it so if you are ok with not having this great zfs 
feature then go-ahead. All the other features of zfs will work as expected.


Now, if you want to present several LUNs with RAID done in HW, then yest 
the best approach usually is to add all them to a pool in a striped 
configuration. ZFS will always put 2 or 3 copies of metadata on 
different LUNs if possible so you will end-up with some protection 
(self-healing) from zfs - for metadata at least.


Other option (more expensive) is to do raid-10 or raid-z on top of LUNs 
which are already protected with some RAID level on a disk array, so for 
example if you would present 4 luns each with RAID-5 done in HW and then 
create a pool 'zpool create test mirror lun1 lun2 mirror lun3 lun4' you 
woule effectively end-up with RAID-50 configuration but it would of 
course halve available logical storage but would allow zfs to do a 
self-healing.



Sun Messaging Environment:
We currently use EMC storage. The storage team manages all Enterprise 
storage. We currently have 10x300gb UFS mailstores presented to the 
OS. Each LUN is a HW RAID 5 device. We will be upgrading the 
application and doing a hardware refresh of this environment, which 
will give us the chance to move to ZFS, but stay on EMC storage. I am 
sure the storage team will not want to present us with JBOD. It is 
there practice to create the HW LUNs and present them to the 
application teams. I don’t want to end up with a complicated scenario, 
but would like to leverage the most I can with ZFS, but on the EMC 
array as I mentioned.



just create a pool which would stripe across such luns.





Sun Directory Environment:
The directory team is running HP DL385 G2, which also has a built-in 
HW RAID controller for 5 internal SAS disks. The team currently has 
DS5.2 deployed on RHEL3, but as we move to DS6.3.1, they may want to 
move to Solaris 10. We have an opportunity to move to ZFS in this 
environment, but am curious how to best leverage ZFS capabilities in 
this scenario. JBOD is very clear, but a lot of manufacturers out 
there are still offering HW RAID technologies, with high-speed caches. 
Using ZFS with these is not very clear to me, and as I mentioned, 
there are very mixed reviews, not on ZFS features, but how it’s used 
in HW RAID settings.


Here you have three options. RAID in HW with one LUN and then just 
create a pool on top of it. ZFS will be able to detect a corruption if 
it happens but won't be able to fix it (at least not for data).


Another option is to present each disk as RAID-0 LUN and then do a 
RAID-10 or RAID-Z in ZFS. Most RAID controllers will still use their 
cache in such a configuration so you would still benefit from it. And 
ZFS will be able to detect and fix corruption if it happens. However a 
procedure of replacing a failed disk drive could be more complicated or 
even require a downtime depending on a controller and if there is a 
management tool on solaris for it (otherwise if disk dies in many pci 
controllers with one disk in raid-0 you will have to go into its bios 
and re-create a failed disk with a new one). But check your controller 
maybe it is not an issue for you or maybe it is even acceptable approach.


The last option would be to disable RAID controller and access disk 
directly and do raid in zfs. That way you lost your cache of course.


If your applications are sensitive to a write latency to your ldap 
database that going with one of the first two options could actually 
prove to be a faster solution (assuming the volume of writes is not so 
big that a cache will be 100% utilized all the time as then it is down 
to disks).



Another thing 

Re: [zfs-discuss] snv_XXX features / fixes - Solaris 10 version

2009-09-18 Thread Robert Milkowski

Richard Elling wrote:


On Sep 18, 2009, at 10:06 AM, Chris Banal wrote:

Since most zfs features / fixes are reported in snv_XXX terms. Is 
there some sort of way to figure out which versions of Solaris 10 
have the equivalent features / fixes?


There is no automated nor easy way to do this. Not all features are
backported to Solaris 10. The best you can hope for is that the CRs
are mentioned in Solaris 10 patches.  Since the contents of many
Solaris 10 CRs are not publicly available, this becomes a form of a
guessing game.

My suggestion: get a subscription for OpenSolaris.
 -- richard

in many cases you can look thru a changelog of a given build (ON) at 
opensolaris.org and then look for a bug id on sunsolve in s10 kernel 
patches to see if it is mentioned in any of them.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ versus mirrroed

2009-09-18 Thread Bill Sommerfeld

On Wed, 2009-09-16 at 14:19 -0700, Richard Elling wrote:
 Actually, I had a ton of data on resilvering which shows mirrors and
 raidz equivalently bottlenecked on the media write bandwidth. However,
 there are other cases which are IOPS bound (or CR bound :-) which
 cover some of the postings here. I think Sommerfeld has some other
 data which could be pertinent.

I'm not sure I have data, but I have anecdotes and observations, and a
few large production pools used for solaris development by me and my
coworkers.

the biggest one (by disk count) takes 80-100 hours to scrub and/or
resilver.

my working hypothesis is that resilver of pools which:
 1) have a lot of files, directories, filesystems, and periodic
snapshots
 2) have atime updates enabled (default config)
 3) have regular (daily) jobs doing large-scale filesystem tree-walks

wind up rewriting most blocks of the dnode files on every tree walk
doing atime updates, and as a result the dnode file (but not most of the
blocks it points to) differs greatly from daily snapshot to daily
snapshot.

as a result, scrub/resilver traversals end up spending most of their 
time doing random reads of the dnode files of each snapshot.

here are some bugs that, if fixed, might help:

6678033 resilver code should prefetch
6730737 investigate colocating directory dnodes

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Andrew Deason
On Fri, 18 Sep 2009 16:38:28 -0400
Robert Milkowski mi...@task.gda.pl wrote:

  No. We need to be able to tell how close to full we are, for
  determining when to start/stop removing things from the cache
  before we can add new items to the cache again.

 
 but having a dedicated dataset will let you answer such a question 
 immediatelly as then you get from zfs information from for the
 dataset on how much space is used (everything: data + metadata) and
 how much is left.

Immediately? There isn't a delay between the write and the next commit
when the space is recorded? (Do you mean a statvfs equivalent, or some
zfs-specific call?)

And the current code is structured such that we record usage changes
before a write; it would be a huge pain to rely on the write to
calculate the usage (for that and other reasons).

  Setting recordsize to 1k if you have lots of files (I assume)
  larger than that doesn't really make sense.
  The problem with metadata is that by default it is also compressed
  so there is no easy way to tell how much disk space it occupies
  for a specified file using standard API.
  
 
  We do not know in advance what file sizes we'll be seeing in
  general. We could of course tell people to tune the cache dataset
  according to their usage pattern, but I don't think users are
  generally going to know what their cache usage pattern looks like.
 
  I can say that at least right now, usually each file will be at
  most 1M long (1M is the max unless the user specifically changes
  it). But between the range 1k-1M, I don't know what the
  distribution looks like.
 

 What I meant was that I believe that default recordsize of 128k
 should be fine for you (files smaller than 128k will use smaller
 recordsize, larger ones will use a recordsize of 128k). The only
 problem will be with files truncated to 0 and growing again as they
 will be stuck with an old recordsize. But in most cases it won't
 probably be a practical problem anyway.

Well, it may or may not be 'fine'; we may have a lot of little files in
the cache, and rounding up to 128k for each one reduces our disk
efficiency somewhat. Files are truncated to 0 and grow again quite often
in busy clients. But that's an efficiency issue, we'd still be able to
stay within the configured limit that way.

But anyway, 128k may be fine for me, but what about if someone sets
their recordsize to something different? That's why I was wondering
about the overhead if someone sets the recordsize to 1k; is there no way
to account for it even if I know the recordsize is 1k?

-- 
Andrew Deason
adea...@sinenomine.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS HW RAID

2009-09-18 Thread Scott Lawson



Lloyd H. Gill wrote:


Hello folks,

I am sure this topic has been asked, but I am new to this list. I have 
read a ton of doc's on the web, but wanted to get some opinions from 
you all. Also, if someone has a digest of the last time this was 
discussed, you can just send that to me. In any case, I am reading a 
lot of mixed reviews related to ZFS on HW RAID devices.


The Sun docs seem to indicate it possible, but not a recommended 
course. I realize there are some advantages, such as snapshots, etc. 
But, the h/w raid will handle 'most' disk problems, basically reducing 
the great capabilities of the big reasons to deploy zfs. One 
suggestion would be to create the h/w RAID LUNs as usual, present them 
to the OS, then do simple striping with ZFS. Here are my two 
applications, where I am presented with this possibility:
Comments below from me as I am a user of both of these environments, bot 
with ZFS. You may also want to check the iMS archives or subscribe to 
the list. This is

where all the Sun Messaging Server gurus hang out.  (I listen mostly ;))

List is : info-...@arnold.com and you can get more info here : 
http://mail.arnold.com/info-ims.htmlx


Sun Messaging Environment:
We currently use EMC storage. The storage team manages all Enterprise 
storage. We currently have 10x300gb UFS mailstores presented to the 
OS. Each LUN is a HW RAID 5 device. We will be upgrading the 
application and doing a hardware refresh of this environment, which 
will give us the chance to move to ZFS, but stay on EMC storage. I am 
sure the storage team will not want to present us with JBOD. It is 
there practice to create the HW LUNs and present them to the 
application teams. I don't want to end up with a complicated scenario, 
but would like to leverage the most I can with ZFS, but on the EMC 
array as I mentioned.
In this environment I do what Bob mentioned in his reply to you and that 
is I prevision two LUNS for each data volume and mirror them with ZFS. 
The LUNS are based on RAID 5
stripes on 3510's, 3511's and 6140's. Mirroring them with ZFS gives all 
of the niceties of ZFS and it will catch any of the silent data 
corruption type issues that hardware RAID
will not. My reasonings for doing this way go back to Disksuite days as 
well. (which I no longer use, ZFS or nothing pretty much these days).


My setup is based on 5 x 250 GB mirrored pairs with around 3-4 million 
messages per volume.


The two LUNS I mirror are *always* provisioned from two separate arrays 
in different data centers. This also means that in the case of a massive 
catastrophe at one
data centre, I should have a good copy from the 'mirror of last resort' 
that I can get our business back up and running on quickly.


Other advantages of this is that it also allows for relatively easy 
array maintenance and upgrades as well. ZFS only remirrors changed 
blocks rather than a complete
block re sync like disksuite does. This allows for very fast convergence 
times in the likes of file servers where change is relatively light, 
albeit continuous. Mirrors
here are super quick to re converge from my experience, a little quicker 
than RAIDZ's. ( I don't have data to back this up, just a casuall 
observation)


In some respect being both a storage guy and a systems guy. Sometimes 
the storage people need to get with the program a bit. :P If you use ZFS 
with one
of it's redundant forms (mirrors or RAIDZ's) then JBOD presentation will 
be fine.


Sun Directory Environment:
The directory team is running HP DL385 G2, which also has a built-in 
HW RAID controller for 5 internal SAS disks. The team currently has 
DS5.2 deployed on RHEL3, but as we move to DS6.3.1, they may want to 
move to Solaris 10. We have an opportunity to move to ZFS in this 
environment, but am curious how to best leverage ZFS capabilities in 
this scenario. JBOD is very clear, but a lot of manufacturers out 
there are still offering HW RAID technologies, with high-speed caches. 
Using ZFS with these is not very clear to me, and as I mentioned, 
there are very mixed reviews, not on ZFS features, but how it's used 
in HW RAID settings.
Sun Directory environment generally isn't very IO intensive, except for 
in massive data reloads or indexing operations. Other than this it is an 
ideal candidate for ZFS
and it's rather nice ARC cache. Memory is cheap on a lot of boxes and it 
will make read only type file systems fly. I imagine your actual living 
LDAP data set on disk
probably won't be larger than 10 Gigs or so? I have around 400K objects 
in mine and it's only about 2 Gigs or so including all our indexes. I 
tend to tune DS up
so that everything it needs is in RAM anyway. As far as diectory server 
goes, are you using the 64 bit version on Linux? If not you should be as 
well.


Thanks for any observations.

Lloyd


___
zfs-discuss mailing list

Re: [zfs-discuss] ZFS file disk usage

2009-09-18 Thread Robert Milkowski

Andrew Deason wrote:

On Fri, 18 Sep 2009 16:38:28 -0400
Robert Milkowski mi...@task.gda.pl wrote:

  

No. We need to be able to tell how close to full we are, for
determining when to start/stop removing things from the cache
before we can add new items to the cache again.
  
  
but having a dedicated dataset will let you answer such a question 
immediatelly as then you get from zfs information from for the

dataset on how much space is used (everything: data + metadata) and
how much is left.



Immediately? There isn't a delay between the write and the next commit
when the space is recorded? (Do you mean a statvfs equivalent, or some
zfs-specific call?)

And the current code is structured such that we record usage changes
before a write; it would be a huge pain to rely on the write to
calculate the usage (for that and other reasons).
  


There will be a delay of up-to 30s currently.

But how much data do you expect to be pushed within 30s?
Lets say it would be even 10g to lots of small file and you would 
calculate the total size by only summing up a logical size of data. 
Would you really expect that an error would be greater than 5% which 
would be 500mb. Does it matter in practice?





Setting recordsize to 1k if you have lots of files (I assume)
larger than that doesn't really make sense.
The problem with metadata is that by default it is also compressed
so there is no easy way to tell how much disk space it occupies
for a specified file using standard API.



We do not know in advance what file sizes we'll be seeing in
general. We could of course tell people to tune the cache dataset
according to their usage pattern, but I don't think users are
generally going to know what their cache usage pattern looks like.

I can say that at least right now, usually each file will be at
most 1M long (1M is the max unless the user specifically changes
it). But between the range 1k-1M, I don't know what the
distribution looks like.

  
  

What I meant was that I believe that default recordsize of 128k
should be fine for you (files smaller than 128k will use smaller
recordsize, larger ones will use a recordsize of 128k). The only
problem will be with files truncated to 0 and growing again as they
will be stuck with an old recordsize. But in most cases it won't
probably be a practical problem anyway.



Well, it may or may not be 'fine'; we may have a lot of little files in
the cache, and rounding up to 128k for each one reduces our disk
efficiency somewhat. Files are truncated to 0 and grow again quite often
in busy clients. But that's an efficiency issue, we'd still be able to
stay within the configured limit that way.

But anyway, 128k may be fine for me, but what about if someone sets
their recordsize to something different? That's why I was wondering
about the overhead if someone sets the recordsize to 1k; is there no way
to account for it even if I know the recordsize is 1k?

  


what is user enables compression like lzjb or even gzip?
How would you like to take it into account before doing writes?

What if user creates a snapshot? How would you take it into account?

I'm under suspicion that you are looking too closely  for no real benefit.
Especially if you don't want to dedicate a dataset to cache you would 
expect other  applications in a system  to write to the same file system 
but different locations which you have no control or ability to predict 
how much data will be written at all. Be it Linux, Solaris, BSD, ... the 
issue will be there.


IMHO a dedicated dataset and statvfs() on it should be good enough, 
eventually with an estimate before writing your data (as a total logical 
file size from application point of view) - however due to compression 
or dedup enabled by user that estimate could be totally wrong so 
probably doesn't actually make sense.



--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS HW RAID

2009-09-18 Thread Robert Milkowski

Scott Lawson wrote:
Sun Directory environment generally isn't very IO intensive, except 
for in massive data reloads or indexing operations. Other than this it 
is an ideal candidate for ZFS
and it's rather nice ARC cache. Memory is cheap on a lot of boxes and 
it will make read only type file systems fly. I imagine your actual 
living LDAP data set on disk
probably won't be larger than 10 Gigs or so? I have around 400K 
objects in mine and it's only about 2 Gigs or so including all our 
indexes. I tend to tune DS up
so that everything it needs is in RAM anyway. As far as diectory 
server goes, are you using the 64 bit version on Linux? If not you 
should be as well.




From my experience enabling lzjb comprssion for DS makes it even faster 
and reduces disk usage by about 2x.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Crazy Phantom Zpools Again

2009-09-18 Thread Cindy Swearingen

Dave,

I've searched opensolaris.org and our internal bug database.
I don't see that anyone else has reported this problem.

I asked someone from the OSOL install team and this behavior
is a mystery.

If you destroyed the phantom pools before you reinstalled,
then they probably returned from the import operations but
I can't be sure.

If you want to export your tank pool and re-import it, then
maybe you should just use zpool import tank until the root
cause of the phantom pools are determined.


Not much help, but some ideas:

1. What does the zpool history -l output say for the phantom pools?
Were they created at the same time as the root pool or the same time
as tank?

2. The phantom pools contain the c8t1* and c9t1* fdisk partitions (p0s) 
that are in your tank pool as whole disks. A strange coincidence.


Does zdb output or fmdump output identify the relationship, if
any, between the c8 and c9 devices in the phantom pools and tank?

3. I can file a bug for you. Please provide the system information,
such as hardware, disks, OS release.


Cindy



On 09/18/09 12:18, Dave Abrahams wrote:

I just did a fresh reinstall of OpenSolaris and I'm again seeing
the phenomenon described in 
http://article.gmane.org/gmane.os.solaris.opensolaris.zfs/26259

which I posted many months ago and got no reply to.

Can someone *please* help me figure out what's going on here?

Thanks in Advance,
--
Dave Abrahams
BoostPro Computing
http://boostpro.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS HW RAID

2009-09-18 Thread David Magda

On Sep 18, 2009, at 16:52, Bob Friesenhahn wrote:

If you care to keep your pool up and alive as much as possible, then  
mirroring across SAN devices is recommended.


One suggestion I heard was to get a LUN that's twice the size, and set  
copies=2. This way you have some redundancy for incorrect checksums.


Haven't done it myself.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS HW RAID

2009-09-18 Thread Bob Friesenhahn

On Fri, 18 Sep 2009, David Magda wrote:


If you care to keep your pool up and alive as much as possible, then 
mirroring across SAN devices is recommended.


One suggestion I heard was to get a LUN that's twice the size, and set 
copies=2. This way you have some redundancy for incorrect checksums.


This only helps for block-level corruption.  It does not help much at 
all if a whole LUN goes away.  It seems best for single disk rpools.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss