Re: [zfs-discuss] RAIDZ versus mirrroed
On Thu, Sep 17, 2009 at 11:41 AM, Adam Leventhal a...@eng.sun.com wrote: RAID-3 bit-interleaved parity (basically not used) There was a hardware RAID chipset that used RAID-3. Netcell Revolution I think it was called. It looked interesting and I thought about grabbing one at the time but never got around to it. Netcell is defunct or got bought out, so the controller is no longer available. -B -- Brandon High : bh...@freaks.com Always try to do things in chronological order; it's less confusing that way. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4540 dead HDD replacement, remains configured.
I have exactly these symptoms on 3 thumpers now. 2 x x4540s and 1 x x4500 Rebooting/Power cycling doesn't even bring them back. The only thing I found, is that if I boot from the osol.2009.06 Cd, I can see all the drives I had to reinstall the OS on one box. I've only just recently upgraded them to snv_122. Before that, I could change disks without problems. Could it be something introduced since snv_111? John -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Thu, 17 Sep 2009 18:40:49 -0400 Robert Milkowski mi...@task.gda.pl wrote: if you would create a dedicated dataset for your cache and set quota on it then instead of tracking a disk space usage for each file you could easily check how much disk space is being used in the dataset. Would it suffice for you? No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. I'd also _like_ not to require a dedicated dataset for it, but it's not like it's difficult for users to create one. Setting recordsize to 1k if you have lots of files (I assume) larger than that doesn't really make sense. The problem with metadata is that by default it is also compressed so there is no easy way to tell how much disk space it occupies for a specified file using standard API. We do not know in advance what file sizes we'll be seeing in general. We could of course tell people to tune the cache dataset according to their usage pattern, but I don't think users are generally going to know what their cache usage pattern looks like. I can say that at least right now, usually each file will be at most 1M long (1M is the max unless the user specifically changes it). But between the range 1k-1M, I don't know what the distribution looks like. I can't get an /estimate/ on the data+metadata disk usage? What about in the hypothetical case of the metadata compression ratio being effectively the same as without compression, what would it be then? -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deduplication
Thanks James! I look forward to these - we could really use dedup in my org. Blake On Thu, Sep 17, 2009 at 6:02 PM, James C. McPherson james.mcpher...@sun.com wrote: On Thu, 17 Sep 2009 11:50:17 -0500 Tim Cook t...@cook.ms wrote: On Thu, Sep 17, 2009 at 5:27 AM, Thomas Burgess wonsl...@gmail.com wrote: I think you're right, and i also think we'll still see a new post asking about it once or twice a week. [snip] As we should. Did the video of the talks about dedup ever even get posted to Sun's site? I never saw it. I remember being told we were all idiots when pointing out that it had mysteriously not been posted... Hi Tim, I certainly do not recall calling anybody an idiot for asking about the video or slideware. I definitely _do_ recall asking for people to be patient because (1) we had lighting problems with the auditorium which interfered with recording video (2) we have been getting the videos professionally edited so that when we can put them up on an appropriate site (which I imagine will be slx.sun.com), then the vids will adhere to the high standards which you have come to expect. (3) professional editing of videos takes time and money. We are getting this done as fast as we can. I asked Deirdre about the videos yesterday, she said that they are almost ready. Rest assured that when they are ready I will announce their availability as soon as I possibly can. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool UNAVAIL even though disk is online: another label issue?
All, this morning, I did pkg image-update from 118 to 123 (internal repo), and upon reboot all I got was the grub prompt - no menu, nothing. I found a 2009.06 CD, and when I boot that and run zpool import, I get told localtank UNAVAIL insufficient replicas c8t1d0ONLINE some research showed that disklabel changes sometimes cause this, so I ran format: AVAILABLE DISK SELECTIONS: 0. c8t0d0 DEFAULT cyl 48639 alt 2 hd 255 sec 63 /p...@0,0/pci108e,5...@7/d...@0,0 1. c8t1d0 ATA-HITACHI HDS7240S-A33A-372.61GB /p...@0,0/pci108e,5...@7/d...@1,0 Specify disk (enter its number): 1 selecting c8t1d0 [disk formatted] Note: capacity in disk label is smaller than the real disk capacity. Select partition expand to adjust the label capacity. [..] partition print Current partition table (original): Total disk sectors available: 781401310 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm 256 372.60GB 781401310 1 unassignedwm 0 0 0 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 8 reservedwm 7814013118.00MB 781417694 Format already tells me that the label doesn't align with the disk size ... should I just do expand, or should I change the first sectore of partition 0 to be 0? I'd appreciate advice on the above, and on how to avoid this in the future. -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Fri, 18 Sep 2009 12:48:34 -0400 Richard Elling richard.ell...@gmail.com wrote: The transactional nature of ZFS may work against you here. Until the data is committed to disk, it is unclear how much space it will consume. Compression clouds the crystal ball further. ...but not impossible. I'm just looking for a reasonable upper bound. For example, if I always rounded up to the next 128k mark, and added an additional 128k, that would always give me an upper bound (for files = 1M), as far as I can tell. But that is not a very tight bound; can you suggest anything better? I'd also _like_ not to require a dedicated dataset for it, but it's not like it's difficult for users to create one. Use delegation. Users can create their own datasets, set parameters, etc. For this case, you could consider changing recordsize, if you really are so worried about 1k. IMHO, it is easier and less expensive in process and pain to just buy more disk when needed. Users of OpenAFS, not unprivileged users. All users I am talking about are the administrators for their machines. I would just like to reduce the number of filesystem-specific steps needed to be taken to set up the cache. You don't need to do anything special for a tmpfs cache, for instance, or ext2/3 caches on linux. -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] addendum: zpool UNAVAIL even though disk is online: another label issue?
michael schuster wrote: All, this morning, I did pkg image-update from 118 to 123 (internal repo), and upon reboot all I got was the grub prompt - no menu, nothing. I found a 2009.06 CD, and when I boot that and run zpool import, I get told localtank UNAVAIL insufficient replicas c8t1d0ONLINE some research showed that disklabel changes sometimes cause this, so I ran format: AVAILABLE DISK SELECTIONS: 0. c8t0d0 DEFAULT cyl 48639 alt 2 hd 255 sec 63 /p...@0,0/pci108e,5...@7/d...@0,0 1. c8t1d0 ATA-HITACHI HDS7240S-A33A-372.61GB /p...@0,0/pci108e,5...@7/d...@1,0 Specify disk (enter its number): 1 selecting c8t1d0 [disk formatted] Note: capacity in disk label is smaller than the real disk capacity. Select partition expand to adjust the label capacity. [..] partition print Current partition table (original): Total disk sectors available: 781401310 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm 256 372.60GB 781401310 1 unassignedwm 0 0 0 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 8 reservedwm 7814013118.00MB 781417694 Format already tells me that the label doesn't align with the disk size ... should I just do expand, or should I change the first sectore of partition 0 to be 0? I'd appreciate advice on the above, and on how to avoid this in the future. I just found out that this disk has been EFI-labelled, which I understand isn't what zfs like/expects. what to do now? TIA Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] snv_XXX features / fixes - Solaris 10 version
Since most zfs features / fixes are reported in snv_XXX terms. Is there some sort of way to figure out which versions of Solaris 10 have the equivalent features / fixes? Thanks, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] addendum: zpool UNAVAIL even though disk is online: another label issue?
Michael, ZFS handles EFI labels just fine, but you need an SMI label on the disk that you are booting from. Are you saying that localtank is your root pool? I believe the OSOL install creates a root pool called rpool. I don't remember if its configurable. Changing labels or partitions from beneath a live pool isn't supported and can cause data loss. Can you describe the changes other than the pkg-image-update that lead up to this problem? Cindy On 09/18/09 11:05, michael schuster wrote: michael schuster wrote: All, this morning, I did pkg image-update from 118 to 123 (internal repo), and upon reboot all I got was the grub prompt - no menu, nothing. I found a 2009.06 CD, and when I boot that and run zpool import, I get told localtank UNAVAIL insufficient replicas c8t1d0ONLINE some research showed that disklabel changes sometimes cause this, so I ran format: AVAILABLE DISK SELECTIONS: 0. c8t0d0 DEFAULT cyl 48639 alt 2 hd 255 sec 63 /p...@0,0/pci108e,5...@7/d...@0,0 1. c8t1d0 ATA-HITACHI HDS7240S-A33A-372.61GB /p...@0,0/pci108e,5...@7/d...@1,0 Specify disk (enter its number): 1 selecting c8t1d0 [disk formatted] Note: capacity in disk label is smaller than the real disk capacity. Select partition expand to adjust the label capacity. [..] partition print Current partition table (original): Total disk sectors available: 781401310 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm 256 372.60GB 781401310 1 unassignedwm 0 0 0 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 8 reservedwm 7814013118.00MB 781417694 Format already tells me that the label doesn't align with the disk size ... should I just do expand, or should I change the first sectore of partition 0 to be 0? I'd appreciate advice on the above, and on how to avoid this in the future. I just found out that this disk has been EFI-labelled, which I understand isn't what zfs like/expects. what to do now? TIA Michael ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] If you have ZFS in production, willing to share some details (with me)?
I am trying to compile some deployment scenarios of ZFS. If you are running ZFS in production, would you be willing to provide (publicly or privately)? # of systems amount of storage application profile(s) type of workload (low, high; random, sequential; read-only, read-write, write-only) storage type(s) industry whether it is private or I can share in a summary anything else that might be of interest Thanks in advance!! Steffen ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] addendum: zpool UNAVAIL even though disk is online: another label issue?
Cindy Swearingen wrote: Michael, ZFS handles EFI labels just fine, but you need an SMI label on the disk that you are booting from. Are you saying that localtank is your root pool? no... (I was on the plane yesterday, I'm still jet-lagged), I should have realised that that's strange. I believe the OSOL install creates a root pool called rpool. I don't remember if its configurable. I didn't do anything to change that. This leads me to the assumption that the disk I should be looking at is actually c8t0d0, the other disk in the format output. Can you describe the changes other than the pkg-image-update that lead up to this problem? 0) pkg refresh; pkg install SUNWipkg 1) pkg image-update (creates opensolaris-119) 2) pkg mount opensolaris-119 /mnt 3) cat /mnt/etc/release (to verify I'd indeed installed b123) 4) pkg umount opensolaris-119 5) pkg rename opensolaris-119 opensolaris-123 # this failed, because it's active 6) pkg activate opensolaris-118 # so I can rename the new one 7) pkg rename ... 8) pkg activate opensolaris-123 9) reboot thx Michael Cindy On 09/18/09 11:05, michael schuster wrote: michael schuster wrote: All, this morning, I did pkg image-update from 118 to 123 (internal repo), and upon reboot all I got was the grub prompt - no menu, nothing. I found a 2009.06 CD, and when I boot that and run zpool import, I get told localtank UNAVAIL insufficient replicas c8t1d0ONLINE some research showed that disklabel changes sometimes cause this, so I ran format: AVAILABLE DISK SELECTIONS: 0. c8t0d0 DEFAULT cyl 48639 alt 2 hd 255 sec 63 /p...@0,0/pci108e,5...@7/d...@0,0 1. c8t1d0 ATA-HITACHI HDS7240S-A33A-372.61GB /p...@0,0/pci108e,5...@7/d...@1,0 Specify disk (enter its number): 1 selecting c8t1d0 [disk formatted] Note: capacity in disk label is smaller than the real disk capacity. Select partition expand to adjust the label capacity. [..] partition print Current partition table (original): Total disk sectors available: 781401310 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm 256 372.60GB 781401310 1 unassignedwm 0 0 0 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 8 reservedwm 7814013118.00MB 781417694 Format already tells me that the label doesn't align with the disk size ... should I just do expand, or should I change the first sectore of partition 0 to be 0? I'd appreciate advice on the above, and on how to avoid this in the future. I just found out that this disk has been EFI-labelled, which I understand isn't what zfs like/expects. what to do now? TIA Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Crazy Phantom Zpools Again
I just did a fresh reinstall of OpenSolaris and I'm again seeing the phenomenon described in http://article.gmane.org/gmane.os.solaris.opensolaris.zfs/26259 which I posted many months ago and got no reply to. Can someone *please* help me figure out what's going on here? Thanks in Advance, -- Dave Abrahams BoostPro Computing http://boostpro.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] If you have ZFS in production, willing to share some details (with me)?
On 9/18/2009 1:51 PM, Steffen Weiberle wrote: I am trying to compile some deployment scenarios of ZFS. # of systems do zfs root count? or only big pools? amount of storage raw or after parity ? -- Jeremy Kister http://jeremy.kister.net./ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Sep 18, 2009, at 7:36 AM, Andrew Deason wrote: On Thu, 17 Sep 2009 18:40:49 -0400 Robert Milkowski mi...@task.gda.pl wrote: if you would create a dedicated dataset for your cache and set quota on it then instead of tracking a disk space usage for each file you could easily check how much disk space is being used in the dataset. Would it suffice for you? No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. The transactional nature of ZFS may work against you here. Until the data is committed to disk, it is unclear how much space it will consume. Compression clouds the crystal ball further. I'd also _like_ not to require a dedicated dataset for it, but it's not like it's difficult for users to create one. Use delegation. Users can create their own datasets, set parameters, etc. For this case, you could consider changing recordsize, if you really are so worried about 1k. IMHO, it is easier and less expensive in process and pain to just buy more disk when needed. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] addendum: zpool UNAVAIL even though disk is online: another label issue?
Cindy Swearingen wrote: Michael, Get some rest. :-) Then see if you can import your root pool while booted from the LiveCD. that's what I tried - I'm never even shown rpool, I probably wouldn't have mentioned localpool at all if I had ;-) After you get to that point, you might search the indiana-discuss archive for tips on resolving the pkg-image-update no grub menu problem. if I don't see rpool, that's not going to be the next step for me, right? thx Michael Cindy On 09/18/09 12:08, michael schuster wrote: Cindy Swearingen wrote: Michael, ZFS handles EFI labels just fine, but you need an SMI label on the disk that you are booting from. Are you saying that localtank is your root pool? no... (I was on the plane yesterday, I'm still jet-lagged), I should have realised that that's strange. I believe the OSOL install creates a root pool called rpool. I don't remember if its configurable. I didn't do anything to change that. This leads me to the assumption that the disk I should be looking at is actually c8t0d0, the other disk in the format output. Can you describe the changes other than the pkg-image-update that lead up to this problem? 0) pkg refresh; pkg install SUNWipkg 1) pkg image-update (creates opensolaris-119) 2) pkg mount opensolaris-119 /mnt 3) cat /mnt/etc/release (to verify I'd indeed installed b123) 4) pkg umount opensolaris-119 5) pkg rename opensolaris-119 opensolaris-123 # this failed, because it's active 6) pkg activate opensolaris-118 # so I can rename the new one 7) pkg rename ... 8) pkg activate opensolaris-123 9) reboot thx Michael Cindy On 09/18/09 11:05, michael schuster wrote: michael schuster wrote: All, this morning, I did pkg image-update from 118 to 123 (internal repo), and upon reboot all I got was the grub prompt - no menu, nothing. I found a 2009.06 CD, and when I boot that and run zpool import, I get told localtank UNAVAIL insufficient replicas c8t1d0ONLINE some research showed that disklabel changes sometimes cause this, so I ran format: AVAILABLE DISK SELECTIONS: 0. c8t0d0 DEFAULT cyl 48639 alt 2 hd 255 sec 63 /p...@0,0/pci108e,5...@7/d...@0,0 1. c8t1d0 ATA-HITACHI HDS7240S-A33A-372.61GB /p...@0,0/pci108e,5...@7/d...@1,0 Specify disk (enter its number): 1 selecting c8t1d0 [disk formatted] Note: capacity in disk label is smaller than the real disk capacity. Select partition expand to adjust the label capacity. [..] partition print Current partition table (original): Total disk sectors available: 781401310 + 16384 (reserved sectors) Part TagFlag First Sector Size Last Sector 0usrwm 256 372.60GB 781401310 1 unassignedwm 0 0 0 2 unassignedwm 0 0 0 3 unassignedwm 0 0 0 4 unassignedwm 0 0 0 5 unassignedwm 0 0 0 6 unassignedwm 0 0 0 8 reservedwm 7814013118.00MB 781417694 Format already tells me that the label doesn't align with the disk size ... should I just do expand, or should I change the first sectore of partition 0 to be 0? I'd appreciate advice on the above, and on how to avoid this in the future. I just found out that this disk has been EFI-labelled, which I understand isn't what zfs like/expects. what to do now? TIA Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS HW RAID
Hello folks, I am sure this topic has been asked, but I am new to this list. I have read a ton of doc¹s on the web, but wanted to get some opinions from you all. Also, if someone has a digest of the last time this was discussed, you can just send that to me. In any case, I am reading a lot of mixed reviews related to ZFS on HW RAID devices. The Sun docs seem to indicate it possible, but not a recommended course. I realize there are some advantages, such as snapshots, etc. But, the h/w raid will handle most¹ disk problems, basically reducing the great capabilities of the big reasons to deploy zfs. One suggestion would be to create the h/w RAID LUNs as usual, present them to the OS, then do simple striping with ZFS. Here are my two applications, where I am presented with this possibility: Sun Messaging Environment: We currently use EMC storage. The storage team manages all Enterprise storage. We currently have 10x300gb UFS mailstores presented to the OS. Each LUN is a HW RAID 5 device. We will be upgrading the application and doing a hardware refresh of this environment, which will give us the chance to move to ZFS, but stay on EMC storage. I am sure the storage team will not want to present us with JBOD. It is there practice to create the HW LUNs and present them to the application teams. I don¹t want to end up with a complicated scenario, but would like to leverage the most I can with ZFS, but on the EMC array as I mentioned. Sun Directory Environment: The directory team is running HP DL385 G2, which also has a built-in HW RAID controller for 5 internal SAS disks. The team currently has DS5.2 deployed on RHEL3, but as we move to DS6.3.1, they may want to move to Solaris 10. We have an opportunity to move to ZFS in this environment, but am curious how to best leverage ZFS capabilities in this scenario. JBOD is very clear, but a lot of manufacturers out there are still offering HW RAID technologies, with high-speed caches. Using ZFS with these is not very clear to me, and as I mentioned, there are very mixed reviews, not on ZFS features, but how it¹s used in HW RAID settings. Thanks for any observations. Lloyd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
Andrew Deason wrote: On Thu, 17 Sep 2009 18:40:49 -0400 Robert Milkowski mi...@task.gda.pl wrote: if you would create a dedicated dataset for your cache and set quota on it then instead of tracking a disk space usage for each file you could easily check how much disk space is being used in the dataset. Would it suffice for you? No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. but having a dedicated dataset will let you answer such a question immediatelly as then you get from zfs information from for the dataset on how much space is used (everything: data + metadata) and how much is left. I'd also _like_ not to require a dedicated dataset for it, but it's not like it's difficult for users to create one. no, it is not. Setting recordsize to 1k if you have lots of files (I assume) larger than that doesn't really make sense. The problem with metadata is that by default it is also compressed so there is no easy way to tell how much disk space it occupies for a specified file using standard API. We do not know in advance what file sizes we'll be seeing in general. We could of course tell people to tune the cache dataset according to their usage pattern, but I don't think users are generally going to know what their cache usage pattern looks like. I can say that at least right now, usually each file will be at most 1M long (1M is the max unless the user specifically changes it). But between the range 1k-1M, I don't know what the distribution looks like. What I meant was that I believe that default recordsize of 128k should be fine for you (files smaller than 128k will use smaller recordsize, larger ones will use a recordsize of 128k). The only problem will be with files truncated to 0 and growing again as they will be stuck with an old recordsize. But in most cases it won't probably be a practical problem anyway. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS HW RAID
On Fri, 18 Sep 2009, Lloyd H. Gill wrote: The Sun docs seem to indicate it possible, but not a recommended course. I realize there are some advantages, such as snapshots, etc. But, the h/w raid will handle most disk problems, basically reducing the great capabilities of the big reasons to deploy zfs. One suggestion would be to create the h/w RAID LUNs as usual, present them to the OS, then do simple striping with ZFS. ZFS will catch issues that the H/W RAID will not. Other than this, there is nothing inherently wrong with the simple striping with ZFS as long as you are confident about your SAN device. If your SAN device fails, the whole ZFS pool may be lost, and if the failure is temporary, then the pool will be down until the SAN is restored. If you care to keep your pool up and alive as much as possible, then mirroring across SAN devices is recommended. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS HW RAID
Hi, see comments inline: Lloyd H. Gill wrote: Hello folks, I am sure this topic has been asked, but I am new to this list. I have read a ton of doc’s on the web, but wanted to get some opinions from you all. Also, if someone has a digest of the last time this was discussed, you can just send that to me. In any case, I am reading a lot of mixed reviews related to ZFS on HW RAID devices. The Sun docs seem to indicate it possible, but not a recommended course. I realize there are some advantages, such as snapshots, etc. But, the h/w raid will handle ‘most’ disk problems, basically reducing the great capabilities of the big reasons to deploy zfs. One suggestion would be to create the h/w RAID LUNs as usual, present them to the OS, then do simple striping with ZFS. Here are my two applications, where I am presented with this possibility: Of course you can use zfs on disk arrays with RAID done in HW, and you still will be able to use most of ZFS features including snapshots, clones, compression, etc. It is not recommended in that sense that unless ZFS has a pool in redundant configuration from ZFS point of view it won't be able to heal corrupted blocks if they occur (but will be able to detect them). Most other filesystem in a market won't even detect such a case not to mention repair it so if you are ok with not having this great zfs feature then go-ahead. All the other features of zfs will work as expected. Now, if you want to present several LUNs with RAID done in HW, then yest the best approach usually is to add all them to a pool in a striped configuration. ZFS will always put 2 or 3 copies of metadata on different LUNs if possible so you will end-up with some protection (self-healing) from zfs - for metadata at least. Other option (more expensive) is to do raid-10 or raid-z on top of LUNs which are already protected with some RAID level on a disk array, so for example if you would present 4 luns each with RAID-5 done in HW and then create a pool 'zpool create test mirror lun1 lun2 mirror lun3 lun4' you woule effectively end-up with RAID-50 configuration but it would of course halve available logical storage but would allow zfs to do a self-healing. Sun Messaging Environment: We currently use EMC storage. The storage team manages all Enterprise storage. We currently have 10x300gb UFS mailstores presented to the OS. Each LUN is a HW RAID 5 device. We will be upgrading the application and doing a hardware refresh of this environment, which will give us the chance to move to ZFS, but stay on EMC storage. I am sure the storage team will not want to present us with JBOD. It is there practice to create the HW LUNs and present them to the application teams. I don’t want to end up with a complicated scenario, but would like to leverage the most I can with ZFS, but on the EMC array as I mentioned. just create a pool which would stripe across such luns. Sun Directory Environment: The directory team is running HP DL385 G2, which also has a built-in HW RAID controller for 5 internal SAS disks. The team currently has DS5.2 deployed on RHEL3, but as we move to DS6.3.1, they may want to move to Solaris 10. We have an opportunity to move to ZFS in this environment, but am curious how to best leverage ZFS capabilities in this scenario. JBOD is very clear, but a lot of manufacturers out there are still offering HW RAID technologies, with high-speed caches. Using ZFS with these is not very clear to me, and as I mentioned, there are very mixed reviews, not on ZFS features, but how it’s used in HW RAID settings. Here you have three options. RAID in HW with one LUN and then just create a pool on top of it. ZFS will be able to detect a corruption if it happens but won't be able to fix it (at least not for data). Another option is to present each disk as RAID-0 LUN and then do a RAID-10 or RAID-Z in ZFS. Most RAID controllers will still use their cache in such a configuration so you would still benefit from it. And ZFS will be able to detect and fix corruption if it happens. However a procedure of replacing a failed disk drive could be more complicated or even require a downtime depending on a controller and if there is a management tool on solaris for it (otherwise if disk dies in many pci controllers with one disk in raid-0 you will have to go into its bios and re-create a failed disk with a new one). But check your controller maybe it is not an issue for you or maybe it is even acceptable approach. The last option would be to disable RAID controller and access disk directly and do raid in zfs. That way you lost your cache of course. If your applications are sensitive to a write latency to your ldap database that going with one of the first two options could actually prove to be a faster solution (assuming the volume of writes is not so big that a cache will be 100% utilized all the time as then it is down to disks). Another thing
Re: [zfs-discuss] snv_XXX features / fixes - Solaris 10 version
Richard Elling wrote: On Sep 18, 2009, at 10:06 AM, Chris Banal wrote: Since most zfs features / fixes are reported in snv_XXX terms. Is there some sort of way to figure out which versions of Solaris 10 have the equivalent features / fixes? There is no automated nor easy way to do this. Not all features are backported to Solaris 10. The best you can hope for is that the CRs are mentioned in Solaris 10 patches. Since the contents of many Solaris 10 CRs are not publicly available, this becomes a form of a guessing game. My suggestion: get a subscription for OpenSolaris. -- richard in many cases you can look thru a changelog of a given build (ON) at opensolaris.org and then look for a bug id on sunsolve in s10 kernel patches to see if it is mentioned in any of them. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ versus mirrroed
On Wed, 2009-09-16 at 14:19 -0700, Richard Elling wrote: Actually, I had a ton of data on resilvering which shows mirrors and raidz equivalently bottlenecked on the media write bandwidth. However, there are other cases which are IOPS bound (or CR bound :-) which cover some of the postings here. I think Sommerfeld has some other data which could be pertinent. I'm not sure I have data, but I have anecdotes and observations, and a few large production pools used for solaris development by me and my coworkers. the biggest one (by disk count) takes 80-100 hours to scrub and/or resilver. my working hypothesis is that resilver of pools which: 1) have a lot of files, directories, filesystems, and periodic snapshots 2) have atime updates enabled (default config) 3) have regular (daily) jobs doing large-scale filesystem tree-walks wind up rewriting most blocks of the dnode files on every tree walk doing atime updates, and as a result the dnode file (but not most of the blocks it points to) differs greatly from daily snapshot to daily snapshot. as a result, scrub/resilver traversals end up spending most of their time doing random reads of the dnode files of each snapshot. here are some bugs that, if fixed, might help: 6678033 resilver code should prefetch 6730737 investigate colocating directory dnodes - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS file disk usage
On Fri, 18 Sep 2009 16:38:28 -0400 Robert Milkowski mi...@task.gda.pl wrote: No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. but having a dedicated dataset will let you answer such a question immediatelly as then you get from zfs information from for the dataset on how much space is used (everything: data + metadata) and how much is left. Immediately? There isn't a delay between the write and the next commit when the space is recorded? (Do you mean a statvfs equivalent, or some zfs-specific call?) And the current code is structured such that we record usage changes before a write; it would be a huge pain to rely on the write to calculate the usage (for that and other reasons). Setting recordsize to 1k if you have lots of files (I assume) larger than that doesn't really make sense. The problem with metadata is that by default it is also compressed so there is no easy way to tell how much disk space it occupies for a specified file using standard API. We do not know in advance what file sizes we'll be seeing in general. We could of course tell people to tune the cache dataset according to their usage pattern, but I don't think users are generally going to know what their cache usage pattern looks like. I can say that at least right now, usually each file will be at most 1M long (1M is the max unless the user specifically changes it). But between the range 1k-1M, I don't know what the distribution looks like. What I meant was that I believe that default recordsize of 128k should be fine for you (files smaller than 128k will use smaller recordsize, larger ones will use a recordsize of 128k). The only problem will be with files truncated to 0 and growing again as they will be stuck with an old recordsize. But in most cases it won't probably be a practical problem anyway. Well, it may or may not be 'fine'; we may have a lot of little files in the cache, and rounding up to 128k for each one reduces our disk efficiency somewhat. Files are truncated to 0 and grow again quite often in busy clients. But that's an efficiency issue, we'd still be able to stay within the configured limit that way. But anyway, 128k may be fine for me, but what about if someone sets their recordsize to something different? That's why I was wondering about the overhead if someone sets the recordsize to 1k; is there no way to account for it even if I know the recordsize is 1k? -- Andrew Deason adea...@sinenomine.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS HW RAID
Lloyd H. Gill wrote: Hello folks, I am sure this topic has been asked, but I am new to this list. I have read a ton of doc's on the web, but wanted to get some opinions from you all. Also, if someone has a digest of the last time this was discussed, you can just send that to me. In any case, I am reading a lot of mixed reviews related to ZFS on HW RAID devices. The Sun docs seem to indicate it possible, but not a recommended course. I realize there are some advantages, such as snapshots, etc. But, the h/w raid will handle 'most' disk problems, basically reducing the great capabilities of the big reasons to deploy zfs. One suggestion would be to create the h/w RAID LUNs as usual, present them to the OS, then do simple striping with ZFS. Here are my two applications, where I am presented with this possibility: Comments below from me as I am a user of both of these environments, bot with ZFS. You may also want to check the iMS archives or subscribe to the list. This is where all the Sun Messaging Server gurus hang out. (I listen mostly ;)) List is : info-...@arnold.com and you can get more info here : http://mail.arnold.com/info-ims.htmlx Sun Messaging Environment: We currently use EMC storage. The storage team manages all Enterprise storage. We currently have 10x300gb UFS mailstores presented to the OS. Each LUN is a HW RAID 5 device. We will be upgrading the application and doing a hardware refresh of this environment, which will give us the chance to move to ZFS, but stay on EMC storage. I am sure the storage team will not want to present us with JBOD. It is there practice to create the HW LUNs and present them to the application teams. I don't want to end up with a complicated scenario, but would like to leverage the most I can with ZFS, but on the EMC array as I mentioned. In this environment I do what Bob mentioned in his reply to you and that is I prevision two LUNS for each data volume and mirror them with ZFS. The LUNS are based on RAID 5 stripes on 3510's, 3511's and 6140's. Mirroring them with ZFS gives all of the niceties of ZFS and it will catch any of the silent data corruption type issues that hardware RAID will not. My reasonings for doing this way go back to Disksuite days as well. (which I no longer use, ZFS or nothing pretty much these days). My setup is based on 5 x 250 GB mirrored pairs with around 3-4 million messages per volume. The two LUNS I mirror are *always* provisioned from two separate arrays in different data centers. This also means that in the case of a massive catastrophe at one data centre, I should have a good copy from the 'mirror of last resort' that I can get our business back up and running on quickly. Other advantages of this is that it also allows for relatively easy array maintenance and upgrades as well. ZFS only remirrors changed blocks rather than a complete block re sync like disksuite does. This allows for very fast convergence times in the likes of file servers where change is relatively light, albeit continuous. Mirrors here are super quick to re converge from my experience, a little quicker than RAIDZ's. ( I don't have data to back this up, just a casuall observation) In some respect being both a storage guy and a systems guy. Sometimes the storage people need to get with the program a bit. :P If you use ZFS with one of it's redundant forms (mirrors or RAIDZ's) then JBOD presentation will be fine. Sun Directory Environment: The directory team is running HP DL385 G2, which also has a built-in HW RAID controller for 5 internal SAS disks. The team currently has DS5.2 deployed on RHEL3, but as we move to DS6.3.1, they may want to move to Solaris 10. We have an opportunity to move to ZFS in this environment, but am curious how to best leverage ZFS capabilities in this scenario. JBOD is very clear, but a lot of manufacturers out there are still offering HW RAID technologies, with high-speed caches. Using ZFS with these is not very clear to me, and as I mentioned, there are very mixed reviews, not on ZFS features, but how it's used in HW RAID settings. Sun Directory environment generally isn't very IO intensive, except for in massive data reloads or indexing operations. Other than this it is an ideal candidate for ZFS and it's rather nice ARC cache. Memory is cheap on a lot of boxes and it will make read only type file systems fly. I imagine your actual living LDAP data set on disk probably won't be larger than 10 Gigs or so? I have around 400K objects in mine and it's only about 2 Gigs or so including all our indexes. I tend to tune DS up so that everything it needs is in RAM anyway. As far as diectory server goes, are you using the 64 bit version on Linux? If not you should be as well. Thanks for any observations. Lloyd ___ zfs-discuss mailing list
Re: [zfs-discuss] ZFS file disk usage
Andrew Deason wrote: On Fri, 18 Sep 2009 16:38:28 -0400 Robert Milkowski mi...@task.gda.pl wrote: No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. but having a dedicated dataset will let you answer such a question immediatelly as then you get from zfs information from for the dataset on how much space is used (everything: data + metadata) and how much is left. Immediately? There isn't a delay between the write and the next commit when the space is recorded? (Do you mean a statvfs equivalent, or some zfs-specific call?) And the current code is structured such that we record usage changes before a write; it would be a huge pain to rely on the write to calculate the usage (for that and other reasons). There will be a delay of up-to 30s currently. But how much data do you expect to be pushed within 30s? Lets say it would be even 10g to lots of small file and you would calculate the total size by only summing up a logical size of data. Would you really expect that an error would be greater than 5% which would be 500mb. Does it matter in practice? Setting recordsize to 1k if you have lots of files (I assume) larger than that doesn't really make sense. The problem with metadata is that by default it is also compressed so there is no easy way to tell how much disk space it occupies for a specified file using standard API. We do not know in advance what file sizes we'll be seeing in general. We could of course tell people to tune the cache dataset according to their usage pattern, but I don't think users are generally going to know what their cache usage pattern looks like. I can say that at least right now, usually each file will be at most 1M long (1M is the max unless the user specifically changes it). But between the range 1k-1M, I don't know what the distribution looks like. What I meant was that I believe that default recordsize of 128k should be fine for you (files smaller than 128k will use smaller recordsize, larger ones will use a recordsize of 128k). The only problem will be with files truncated to 0 and growing again as they will be stuck with an old recordsize. But in most cases it won't probably be a practical problem anyway. Well, it may or may not be 'fine'; we may have a lot of little files in the cache, and rounding up to 128k for each one reduces our disk efficiency somewhat. Files are truncated to 0 and grow again quite often in busy clients. But that's an efficiency issue, we'd still be able to stay within the configured limit that way. But anyway, 128k may be fine for me, but what about if someone sets their recordsize to something different? That's why I was wondering about the overhead if someone sets the recordsize to 1k; is there no way to account for it even if I know the recordsize is 1k? what is user enables compression like lzjb or even gzip? How would you like to take it into account before doing writes? What if user creates a snapshot? How would you take it into account? I'm under suspicion that you are looking too closely for no real benefit. Especially if you don't want to dedicate a dataset to cache you would expect other applications in a system to write to the same file system but different locations which you have no control or ability to predict how much data will be written at all. Be it Linux, Solaris, BSD, ... the issue will be there. IMHO a dedicated dataset and statvfs() on it should be good enough, eventually with an estimate before writing your data (as a total logical file size from application point of view) - however due to compression or dedup enabled by user that estimate could be totally wrong so probably doesn't actually make sense. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS HW RAID
Scott Lawson wrote: Sun Directory environment generally isn't very IO intensive, except for in massive data reloads or indexing operations. Other than this it is an ideal candidate for ZFS and it's rather nice ARC cache. Memory is cheap on a lot of boxes and it will make read only type file systems fly. I imagine your actual living LDAP data set on disk probably won't be larger than 10 Gigs or so? I have around 400K objects in mine and it's only about 2 Gigs or so including all our indexes. I tend to tune DS up so that everything it needs is in RAM anyway. As far as diectory server goes, are you using the 64 bit version on Linux? If not you should be as well. From my experience enabling lzjb comprssion for DS makes it even faster and reduces disk usage by about 2x. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Crazy Phantom Zpools Again
Dave, I've searched opensolaris.org and our internal bug database. I don't see that anyone else has reported this problem. I asked someone from the OSOL install team and this behavior is a mystery. If you destroyed the phantom pools before you reinstalled, then they probably returned from the import operations but I can't be sure. If you want to export your tank pool and re-import it, then maybe you should just use zpool import tank until the root cause of the phantom pools are determined. Not much help, but some ideas: 1. What does the zpool history -l output say for the phantom pools? Were they created at the same time as the root pool or the same time as tank? 2. The phantom pools contain the c8t1* and c9t1* fdisk partitions (p0s) that are in your tank pool as whole disks. A strange coincidence. Does zdb output or fmdump output identify the relationship, if any, between the c8 and c9 devices in the phantom pools and tank? 3. I can file a bug for you. Please provide the system information, such as hardware, disks, OS release. Cindy On 09/18/09 12:18, Dave Abrahams wrote: I just did a fresh reinstall of OpenSolaris and I'm again seeing the phenomenon described in http://article.gmane.org/gmane.os.solaris.opensolaris.zfs/26259 which I posted many months ago and got no reply to. Can someone *please* help me figure out what's going on here? Thanks in Advance, -- Dave Abrahams BoostPro Computing http://boostpro.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS HW RAID
On Sep 18, 2009, at 16:52, Bob Friesenhahn wrote: If you care to keep your pool up and alive as much as possible, then mirroring across SAN devices is recommended. One suggestion I heard was to get a LUN that's twice the size, and set copies=2. This way you have some redundancy for incorrect checksums. Haven't done it myself. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS HW RAID
On Fri, 18 Sep 2009, David Magda wrote: If you care to keep your pool up and alive as much as possible, then mirroring across SAN devices is recommended. One suggestion I heard was to get a LUN that's twice the size, and set copies=2. This way you have some redundancy for incorrect checksums. This only helps for block-level corruption. It does not help much at all if a whole LUN goes away. It seems best for single disk rpools. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss