Re: [zfs-discuss] [zfs] Re: how to know available disk space, 22% free space missing
Hello, Any comments/suggestions about this would be very nice.. Thanks! -- Pasi On Fri, Feb 08, 2013 at 05:09:56PM +0200, Pasi Kärkkäinen wrote: I'm seeing weird output aswell: # zpool list foo NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT foo 5.44T 4.44T 1023G81% 14.49x ONLINE - # zfs list | grep foo foo 62.9T 0 250G /volumes/foo foo/.nza-reserve 31K 100M31K none foo/foo 62.6T 0 62.6T /volumes/foo/foo # zfs list -o space foo NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD foo 0 62.9T 0250G 0 62.7T # zfs list -o space foo/foo NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD foo/foo 0 62.6T 0 62.6T 0 0 What's the correct way of finding out what actually uses/reserves that 1023G of FREE in the zpool? At this point the filesystems are full, and it's not possible to write to them anymore. Also creating new filesystems to the pool fail: Operation completed with error: cannot create 'foo/Test': out of space So the zpool is full for real. I'd like to better understand what actually uses that 1023G of FREE space reported by zpool.. 1023G out of 4.32T is around 22% overhead.. zpool foo consists of 3x mirror vdevs, so there's no raidz involved. 62.6T / 14.49x dedup-ratio = 4.32T Which is pretty close to the ALLOC value reported by zpool.. Data on the filesystem is VM images written over NFS. Thanks, -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to know available disk space
On Wed, Feb 06, 2013 at 08:03:13PM -0700, Jan Owoc wrote: On Wed, Feb 6, 2013 at 4:26 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: When I used zpool status after the system crashed, I saw this: NAME SIZE ALLOC FREE EXPANDSZCAP DEDUP HEALTH ALTROOT storage 928G 568G 360G -61% 1.00x ONLINE - I did some cleanup, so I could turn things back on ... Freed up about 4G. Now, when I use zpool status I see this: NAME SIZE ALLOC FREE EXPANDSZCAP DEDUP HEALTH ALTROOT storage 928G 564G 364G -60% 1.00x ONLINE - When I use zfs list storage I see this: NAME USED AVAIL REFER MOUNTPOINT storage 909G 4.01G 32.5K /storage So I guess the lesson is (a) refreservation and zvol alone aren't enough to ensure your VM's will stay up. and (b) if you want to know how much room is *actually* available, as in usable, as in, how much can I write before I run out of space, you should use zfs list and not zpool status Could you run zfs list -o space storage? It will show how much is used by the data, the snapshots, refreservation, and children (if any). I read somewhere that one should always use zfs list to determine how much space is actually available to be written on a given filesystem. I have an idea, but it's a long shot. If you created more than one zfs on that pool, and added a reservation to each one, then that space is still technically unallocated as far as zpool list is concerned, but is not available to writing when you do zfs list. I would imagine you have one or more of your VMs that grew outside of their refreservation and now crashed for lack of free space on their zfs. Some of the other VMs aren't using their refreservation (yet), so they could, between them, still write 360GB of stuff to the drive. I'm seeing weird output aswell: # zpool list foo NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT foo 5.44T 4.44T 1023G81% 14.49x ONLINE - # zfs list | grep foo foo 62.9T 0 250G /volumes/foo foo/.nza-reserve 31K 100M31K none foo/foo 62.6T 0 62.6T /volumes/foo/foo # zfs list -o space foo NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD foo 0 62.9T 0250G 0 62.7T # zfs list -o space foo/foo NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD foo/foo 0 62.6T 0 62.6T 0 0 What's the correct way of finding out what actually uses/reserves that 1023G of FREE in the zpool? At this point the filesystems are full, and it's not possible to write to them anymore. Also creating new filesystems to the pool fail: Operation completed with error: cannot create 'foo/Test': out of space So the zpool is full for real. I'd like to better understand what actually uses that 1023G of FREE space reported by zpool.. 1023G out of 4.32T is around 22% overhead.. zpool foo consists of 3x mirror vdevs, so there's no raidz involved. 62.6T / 14.49x dedup-ratio = 4.32T Which is pretty close to the ALLOC value reported by zpool.. Data on the filesystem is VM images written over NFS. Thanks, -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to know available disk space
On Fri, Feb 08, 2013 at 09:47:38PM +, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: Pasi Kärkkäinen [mailto:pa...@iki.fi] What's the correct way of finding out what actually uses/reserves that 1023G of FREE in the zpool? Maybe this isn't exactly what you need, but maybe: for fs in `zfs list -H -o name` ; do echo $fs ; zfs get reservation,refreservation,usedbyrefreservation $fs ; done I checked this and there are no reservations configured (or well, there are the 100MB defaults, but not more than that). So reservations don't explain this.. At this point the filesystems are full, and it's not possible to write to them anymore. You'll have to either reduce your reservations, or destroy old snapshots. Or add more disks. There aren't any snapshots either.. I know adding disks will fix the problem, but I'd like to understand why zpool says there is almost 1TB of FREE space when clearly there isn't.. Thanks for the reply! -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On Sun, Jan 20, 2013 at 07:51:15PM -0800, Richard Elling wrote: 2. VAAI support. VAAI has 4 features, 3 of which have been in illumos for a long time. The remaining feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor product, but the CEO made a conscious (and unpopular) decision to keep that code from the community. Over the summer, another developer picked up the work in the community, but I've lost track of the progress and haven't seen an RTI yet. I assume SCSI UNMAP is implemented in Comstar in NexentaStor? Isn't Comstar CDDL licensed? There's also this: https://www.illumos.org/issues/701 .. which says UNMAP support was added to Illumos Comstar 2 years ago. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?
On Tue, Jan 08, 2013 at 06:36:18AM -0500, Ray Arachelian wrote: On 01/07/2013 04:16 PM, Sašo Kiselkov wrote: PERC H200 are well behaved cards that are easy to reflash and work well (even in JBOD mode) on Illumos - they are essentially a LSI SAS 9211. If you can get them, they're one heck of a reliable beast, and cheap too! I've had trouble with one of those (Dell PERC H200) in a Z68X-UD3H-B3 motherboard. When it was inserted in any slot, the machine wouldn't power on. I put it in a Dell desktop I borrowed for a day and it worked there. Any idea as to what might be the trouble? Couldn't even get it working long enough to attempt to reflash its BIOS. The machine would power on for a few seconds and immediately turn off. wild guess: Not enough available PCI option rom memory for the H200 card on that motherboard? -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Appliance as a general-purpose server question
On Tue, Nov 27, 2012 at 08:52:06AM +0100, Grégory Giannoni wrote: The LSI 9240-4I was not able to connect to the 25-drives bay ; Not tested LSI 9260-16I or LSI 9280-24i. What was the problem connecting LSI 9240-4i to the 25-drives bay? -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Appliance as a general-purpose server question
On Thu, Nov 29, 2012 at 09:42:21AM +0100, Grégory Giannoni wrote: Le 29 nov. 2012 à 09:27, Pasi Kärkkäinen a écrit : The LSI 9240-4I was not able to connect to the 25-drives bay ; Not tested LSI 9260-16I or LSI 9280-24i. What was the problem connecting LSI 9240-4i to the 25-drives bay? The 25-drives backplane needs two SFF-8087 (multilane cables) to work correctly. The LSI 9240-4i has just one SFF-8087 port. Yeah, that explains :) -- Pasi Using 2 LSI 9240-4i didn't worked also. -- Grégory Giannoni http://www.wmaker.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] all in one server
On Tue, Sep 18, 2012 at 05:30:56PM +0200, Erik Ableson wrote: If you're running ESXi with a vSphere license, I'd recommend looking at VDR (free with the vCenter license) for backing up the VMs to the little HPs since you get compressed and deduplicated backups that will minimize the replication bandwidth requirements. Don't look at VDR. It's known to be very buggy and corrupt itself in no time. Also it's known to do bad restores overwriting *wrong* VMs. VMware also killed it and replaced it with another product. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migrating 512 byte block zfs root pool to 4k disks
On Fri, Jun 15, 2012 at 06:23:42PM -0500, Timothy Coalson wrote: Sorry, if you meant distinguishing between true 512 and emulated 512/4k, I don't know, it may be vendor-specific as to whether they expose it through device commands at all. At least on Linux you can see the info from: /sys/block/disk/queue/logical_block_size=512 /sys/block/disk/queue/physical_block_size=4096 -- Pasi Tim On Fri, Jun 15, 2012 at 6:02 PM, Timothy Coalson tsc...@mst.edu wrote: On Fri, Jun 15, 2012 at 5:35 PM, Jim Klimov jimkli...@cos.ru wrote: 2012-06-16 0:05, John Martin wrote: Its important to know... ...whether the drive is really 4096p or 512e/4096p. BTW, is there a surefire way to learn that programmatically from Solaris or its derivates prtvtoc device should show the block size the OS thinks it has. Or you can use format, select the disk from a list that includes the model number and size, and use verify. Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migrating 512 byte block zfs root pool to 4k disks
On Wed, Jun 27, 2012 at 01:42:27AM +0300, Pasi Kärkkäinen wrote: On Fri, Jun 15, 2012 at 06:23:42PM -0500, Timothy Coalson wrote: Sorry, if you meant distinguishing between true 512 and emulated 512/4k, I don't know, it may be vendor-specific as to whether they expose it through device commands at all. At least on Linux you can see the info from: /sys/block/disk/queue/logical_block_size=512 /sys/block/disk/queue/physical_block_size=4096 Oh, and also these methods work on Linux: # hdparm -I /dev/sdc | grep Sector Logical Sector size: 512 bytes Physical Sector size: 4096 bytes Logical Sector-0 offset:512 bytes And then there's the BLKPBSZGET ioctl. So I'd be surprised if that stuff isn't implemented on *solaris.. -- Pasi Tim On Fri, Jun 15, 2012 at 6:02 PM, Timothy Coalson tsc...@mst.edu wrote: On Fri, Jun 15, 2012 at 5:35 PM, Jim Klimov jimkli...@cos.ru wrote: 2012-06-16 0:05, John Martin wrote: Its important to know... ...whether the drive is really 4096p or 512e/4096p. BTW, is there a surefire way to learn that programmatically from Solaris or its derivates prtvtoc device should show the block size the OS thinks it has. Or you can use format, select the disk from a list that includes the model number and size, and use verify. Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Sun, Jan 08, 2012 at 06:59:57AM +0400, Jim Klimov wrote: 2012-01-08 5:37, Richard Elling ??: The big question is whether they are worth the effort. Spares solve a serviceability problem and only impact availability in an indirect manner. For single-parity solutions, spares can make a big difference in MTTDL, but have almost no impact on MTTDL for double-parity solutions (eg. raidz2). Well, regarding this part: in the presentation linked in my OP, the IBM presenter suggests that for a 6-disk raid10 (3 mirrors) with one spare drive, overall a 7-disk set, there are such options for critical hits to data redundancy when one of drives dies: 1) Traditional RAID - one full disk is a mirror of another full disk; 100% of a disk's size is critical and has to be prelicated into a spare drive ASAP; 2) Declustered RAID - all 7 disks are used for 2 unique data blocks from original setup and one spare block (I am not sure I described it well in words, his diagram shows it better); if a single disk dies, only 1/7 worth of disk size is critical (not redundant) and can be fixed faster. For their typical 47-disk sets of RAID-7-like redundancy, under 1% of data becomes critical when 3 disks die at once, which is (deemed) unlikely as is. Apparently, in the GPFS layout, MTTDL is much higher than in raid10+spare with all other stats being similar. I am not sure I'm ready (or qualified) to sit down and present the math right now - I just heard some ideas that I considered worth sharing and discussing ;) Thanks for the video link (http://www.youtube.com/watch?v=2g5rx4gP6yU). It's very interesting! GPFS Native RAID seems to be more advanced than current ZFS, and it even has rebalancing implemented (the infamous missing zfs bp-rewrite). It'd definitely be interesting to have something like this implemented in ZFS. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] weird bug with Seagate 3TB USB3 drive
On Sat, Nov 12, 2011 at 10:08:04AM -0800, Richard Elling wrote: On Nov 12, 2011, at 8:31 AM, Pasi Kärkkäinen wrote: On Sat, Nov 12, 2011 at 08:15:31AM -0500, David Magda wrote: On Nov 12, 2011, at 00:55, Richard Elling wrote: Better than ? If the disks advertise 512 bytes, the only way around it is with a whitelist. I would be rather surprised if Oracle sells 4KB sector disks for Solaris systems? Solaris 10. OpenSolaris. But would it be surprising to use SANs with Solaris? Or perhaps run Solaris under some kind of virtualized environment where the virtual disk has a particular block size? Or maybe SSDs, which tend to read/write/delete in certain block sizes? In these situations simply assuming 512 may slow things down. And if Solaris 11 is going to be around for a decade or so, I'd hazard to guess that 512B sector disks will become less and less prevalent as time goes on. Might as well enable the functionality now, when 4K is rarer, so you have more time to test and tunes things out?rather than later when you can potentially be left scrambling. As Pasi Kärkkäinen mentions, there's not much you can do if the disks lies (just as has been seen with disks that lie about flushing the cache). This is mostly a temporary kludge for legacy's sake. More and more disks will be truthful as times goes on. Most 4kB/sector disks already today properly report both the physical (4kB) and logical (512b) sector sizes. It sounds like *solaris is only checking the logical (512b) sector size, not the physical (4kB) sector size.. ZFS uses the physical block size. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#294 Hmm.. so everything should just work? Does some other part of the code use logical block size then, for example to calculate the ashift? Maybe I should read the code :) -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] weird bug with Seagate 3TB USB3 drive
On Fri, Nov 11, 2011 at 09:55:29PM -0800, Richard Elling wrote: On Nov 10, 2011, at 7:47 PM, David Magda wrote: On Nov 10, 2011, at 18:41, Daniel Carosone wrote: On Tue, Oct 11, 2011 at 08:17:55PM -0400, John D Groenveld wrote: Under both Solaris 10 and Solaris 11x, I receive the evil message: | I/O request is not aligned with 4096 disk sector size. | It is handled through Read Modify Write but the performance is very low. I got similar with 4k sector 'disks' (as a comstar target with blk=4096) when trying to use them to force a pool to ashift=12. The labels are found at the wrong offset when the block numbers change, and maybe the GPT label has issues too. Anyone know if Solaris 11 has better support for detecting the native block size of the underlying storage? Better than ? If the disks advertise 512 bytes, the only way around it is with a whitelist. I would be rather surprised if Oracle sells 4KB sector disks for Solaris systems? Afaik the disks advertise both the physical and logical sector size.. at least on Linux you can see that the disk emulates 512 bytes/sector, but natively it uses 4kB/sector. /sys/block/disk/queue/logical_block_size=512 /sys/block/disk/queue/physical_block_size=4096 The info should be available through IDENTIFY DEVICE (ATA) or READ CAPACITY 16 (SCSI) commands. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] weird bug with Seagate 3TB USB3 drive
On Sat, Nov 12, 2011 at 08:15:31AM -0500, David Magda wrote: On Nov 12, 2011, at 00:55, Richard Elling wrote: Better than ? If the disks advertise 512 bytes, the only way around it is with a whitelist. I would be rather surprised if Oracle sells 4KB sector disks for Solaris systems? Solaris 10. OpenSolaris. But would it be surprising to use SANs with Solaris? Or perhaps run Solaris under some kind of virtualized environment where the virtual disk has a particular block size? Or maybe SSDs, which tend to read/write/delete in certain block sizes? In these situations simply assuming 512 may slow things down. And if Solaris 11 is going to be around for a decade or so, I'd hazard to guess that 512B sector disks will become less and less prevalent as time goes on. Might as well enable the functionality now, when 4K is rarer, so you have more time to test and tunes things out?rather than later when you can potentially be left scrambling. As Pasi Kärkkäinen mentions, there's not much you can do if the disks lies (just as has been seen with disks that lie about flushing the cache). This is mostly a temporary kludge for legacy's sake. More and more disks will be truthful as times goes on. Most 4kB/sector disks already today properly report both the physical (4kB) and logical (512b) sector sizes. It sounds like *solaris is only checking the logical (512b) sector size, not the physical (4kB) sector size.. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [OpenIndiana-discuss] Question about WD drives with Super Micro systems
On Sat, Aug 06, 2011 at 07:45:31PM +0200, Roy Sigurd Karlsbakk wrote: Might this be the SATA drives taking too long to reallocate bad sectors? This is a common problem desktop drives have, they will stop and basically focus on reallocating the bad sector as long as it takes, which causes the raid setup to time out the operation and flag the drive as failed. The enterprise sata drives, typically the same as the high performing desktop drive, only they have a short timeout on how long they are allowed to try and reallocate a bad sector so they don't hit the failed drive timeout. Some drive firmwares, such as older WD blacks if memory serves, had the ability to be forced to behave like the enterprise drive, but WD updated the firmware so this is longer possible. This is why you see SATA drives that typically have almost identical specs, but one will be $69 and the other $139 - the former is a desktop model while the latter is an enterprise or raid specific model. I believe it's called different things by different brands: TLER, ERC, and CCTL (?). I doubt this is about the lack of TLER et al. Some, or most, of the drives ditched by ZFS have shown to be quite good indeed. I guess this is a WD vs Intel SAS expanders issue What exact chassis / backplane / SAS-expander is that? (with Intel SAS expander). -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about drive LEDs
On Sat, Jun 18, 2011 at 09:49:44PM +0200, Roy Sigurd Karlsbakk wrote: Hi all I have a few machines setup with OI 148, and I can't make the LEDs on the drives work when something goes bad. The chassies are supermicro ones, and work well, normally. Any idea how to make drive LEDs wirk with this setup? Some questions: - So the Supermicro chassis has SES support? - You're able to see which disk in which chassis slot, by the info from SES? - Are you able to control the LEDs manually through SES? - Did you configure FMA in any way? -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Sat, Jun 11, 2011 at 08:26:34PM +0400, Jim Klimov wrote: 2011-06-11 19:15, Pasi Kärkkäinen ??: On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote: I've had two incidents where performance tanked suddenly, leaving the VM guests and Nexenta SSH/Web consoles inaccessible and requiring a full reboot of the array to restore functionality. In both cases, it was the Intel X-25M L2ARC SSD that failed or was offlined. NexentaStor failed to alert me on the cache failure, however the general ZFS FMA alert was visible on the (unresponsive) console screen. The zpool status output showed: cache c6t5001517959467B45d0 FAULTED 2 542 0 too many errors This did not trigger any alerts from within Nexenta. I was under the impression that an L2ARC failure would not impact the system. But in this case, it was the culprit. I've never seen any recommendations to RAID L2ARC for resiliency. Removing the bad SSD entirely from the server got me back running, but I'm concerned about the impact of the device failure and the lack of notification from NexentaStor. IIRC recently there was discussion on this list about firmware bug on the Intel X25 SSDs causing them to fail under high disk IO with reset storms. Even if so, this does not forgive ZFS hanging - especially if it detected the drive failure, and especially if this drive is not required for redundant operation. I've seen similar bad behaviour on my oi_148a box when I tested USB flash devices as L2ARC caches and occasionally they died by slightly moving out of the USB socket due to vibration or whatever reason ;) Similarly, this oi_148a box hung upon loss of SATA connection to a drive in the raidz2 disk set due to unreliable cable connectors, while it should have stalled IOs to that pool but otherwise the system should have remained remain responsive (tested failmode=continue and failmode=wait on different occasions). So I can relate - these things happen, they do annoy, and I hope they will be fixed sometime soon so that ZFS matches its docs and promises ;) True, definitely sounds like a bug in ZFS aswell.. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations
On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote: Posted in greater detail at Server Fault - [1]http://serverfault.com/q/277966/13325 I have an HP ProLiant DL380 G7 system running NexentaStor. The server has 36GB RAM, 2 LSI 9211-8i SAS controllers (no SAS expanders), 2 SAS system drives, 12 SAS data drives, a hot-spare disk, an Intel X25-M L2ARC cache and a DDRdrive PCI ZIL accelerator. This system serves NFS to multiple VMWare hosts. I also have about 90-100GB of deduplicated data on the array. I've had two incidents where performance tanked suddenly, leaving the VM guests and Nexenta SSH/Web consoles inaccessible and requiring a full reboot of the array to restore functionality. In both cases, it was the Intel X-25M L2ARC SSD that failed or was offlined. NexentaStor failed to alert me on the cache failure, however the general ZFS FMA alert was visible on the (unresponsive) console screen. The zpool status output showed: cache c6t5001517959467B45d0 FAULTED 2 542 0 too many errors This did not trigger any alerts from within Nexenta. I was under the impression that an L2ARC failure would not impact the system. But in this case, it was the culprit. I've never seen any recommendations to RAID L2ARC for resiliency. Removing the bad SSD entirely from the server got me back running, but I'm concerned about the impact of the device failure and the lack of notification from NexentaStor. What's the current best-choice SSD for L2ARC cache applications these days? It seems as though the Intel units are no longer well-regarded. IIRC recently there was discussion on this list about firmware bug on the Intel X25 SSDs causing them to fail under high disk IO with reset storms. Maybe you're hitting that firmware bug. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
On Fri, Mar 18, 2011 at 06:26:37PM -0700, Michael DeMan wrote: ZFSv28 is in HEAD now and will be out in 8.3. ZFS + HAST in 9.x means being able to cluster off different hardware. In regards to OpenSolaris and Indiana - can somebody clarify the relationship there? It was clear with OpenSolaris that the latest/greatest ZFS would always be available since it was a guinea-pig product for cost conscious folks and served as an excellent area for Sun to get marketplace feedback and bug fixes done before rolling updates into full Solaris. To me it seems that Open Indiana is basically a green branch off of a dead tree - if I am wrong, please enlighten me. Illumos project was started as a fork of OpenSolaris when Oracle was still publishing OpenSolaris sources. Then Oracle closed OpenSolaris development, and decided to call upcoming (closed) versions Solaris 11 Express, with no source included. Illumos project continued the development based on the latest published OpenSolaris sources, and a bit later OpenIndiana *distribution* was announced to deliver a binary distro based on OpenSolaris/Illumos. So in short Illumos is the development project, which hosts the new sources, and OpenIndiana is a binary distro based on it. -- Pasi On Mar 18, 2011, at 6:16 PM, Roy Sigurd Karlsbakk wrote: I think we all feel the same pain with Oracle's purchase of Sun. FreeBSD that has commercial support for ZFS maybe? Fbsd currently has a very old zpool version, not suitable for running with SLOGs, since if you lose it, you may lose the pool, which isn't very amusing... Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] native ZFS on Linux
On Sat, Feb 12, 2011 at 08:54:26PM +0100, Roy Sigurd Karlsbakk wrote: I see that Pinguy OS, an uber-Ubuntu o/s, includes native ZFS support. Any pointers to more info on this? There are some work in progress from http://zfsonlinux.org/, but the posix layer was still lacking last I checked kqstor made the posix layer. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and TRIM
On Mon, Jan 31, 2011 at 03:41:52PM +0100, Joerg Schilling wrote: Brandon High bh...@freaks.com wrote: On Sat, Jan 29, 2011 at 8:31 AM, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: What is the status of ZFS support for TRIM? I believe it's been supported for a while now. http://www.c0t0d0s0.org/archives/6792-SATA-TRIM-support-in-Opensolaris.html The command is implemented in the sata driver but there does ot seem to be any user of the code. Btw is the SCSI equivalent also implemented? iirc it was called SCSI UNMAP (for SAS). -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] reliable, enterprise worthy JBODs?
On Tue, Jan 25, 2011 at 11:53:49AM -0800, Rocky Shek wrote: Philip, You can consider DataON DNS-1600 4U 24Bay 6Gb/s SAS JBOD Storage. http://dataonstorage.com/dataon-products/dns-1600-4u-6g-sas-to-sas-sata-jbod -storage.html It is the best fit for ZFS Storage application. It can be a good replacement of Sun/Oracle J4400 and J4200 There are also Ultra density DNS-1660 4U 60 Bay 6Gb/s SAS JBOD Storage and other form factor JBOD. http://dataonstorage.com/dataon-products/6g-sas-jbod/dns-1660-4u-60-bay-6g-3 5inch-sassata-jbod.html Does (Open)Solaris FMA work with these DataON JBODs? .. meaning do the failure LEDs work automatically in the case of disk failure? I guess that requires the SES chip on the JBOD to include proper drive identification for all slots. -- Pasi Rocky -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Philip Brown Sent: Tuesday, January 25, 2011 10:05 AM To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] reliable, enterprise worthy JBODs? So, another hardware question :) ZFS has been touted as taking maximal advantage of disk hardware, to the point where it can be used efficiently and cost-effectively on JBODs, rather than having to throw more expensive RAID arrays at it. Only trouble is.. JBODs seem to have disappeared :( Sun/Oracle has discontinued its j4000 line, with no replacement that I can see. IBM seems to have some nice looking hardware in the form of its EXP3500 expansion trays... but they only support it connected to an IBM (SAS) controller... which is only supported when plugged into IBM server hardware :( Any other suggestions for (large-)enterprise-grade, supported JBOD hardware for ZFS these days? Either fibre or SAS would be okay. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Sat, Jan 08, 2011 at 12:33:50PM -0500, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Garrett D'Amore When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, you get a product that has been through a rigorous qualification process How do I do this, exactly? I am serious. Before too long, I'm going to need another server, and I would very seriously consider reprovisioning my unstable Dell Solaris server to become a linux or some other stable machine. The role it's currently fulfilling is the backup server, which basically does nothing except zfs receive from the primary Sun solaris 10u9 file server. Since the role is just for backups, it's a perfect opportunity for experimentation, hence the Dell hardware with solaris. I'd be happy to put some other configuration in there experimentally instead ... say ... nexenta. Assuming it will be just as good at zfs receive from the primary server. Is there some specific hardware configuration you guys sell? Or recommend? How about a Dell R510/R610/R710? Buy the hardware separately and buy NexentaStor as just a software product? Or buy a somehow more certified hardware software bundle together? If I do encounter a bug, where the only known fact is that the system keeps crashing intermittently on an approximately weekly basis, and there is absolutely no clue what's wrong in hardware or software... How do you guys handle it? If you'd like to follow up offlist, that's fine. Then just email me at the email address: nexenta at nedharvey.com (I use disposable email addresses on mailing lists like this, so at any random unknown time, I'll destroy my present alias and start using a new one.) Hey, Other OS's have had problems with the Broadcom NICs aswell.. See for example this RHEL5 bug: https://bugzilla.redhat.com/show_bug.cgi?id=520888 Host crashing probably due to MSI-X IRQs with bnx2 NIC.. And VMware vSphere ESX/ESXi 4.1 crashing with bnx2x: http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1029368 So I guess there are firmware/driver problems affecting not just Solaris but also other operating systems.. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Looking for 3.5 SSD for ZIL
On Wed, Dec 22, 2010 at 11:36:48AM +0100, Stephan Budach wrote: Hello all, I am shopping around for 3.5 SSDs that I can mount into my storage and use as ZIL drives. As of yet, I have only found 3.5 models with the Sandforce 1200, which was not recommended on this list. I think the recommendation was not to use SSDs at all for ZIL, not just specifially Sandforce controllers? -- Pasi Does anyone maybe know of a model that has the Sandforce 1500 and is 3.5? Or any other 3.5 SSD that he/she can recommend? Cheers, budy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Looking for 3.5 SSD for ZIL
On Wed, Dec 22, 2010 at 01:43:35PM +, Jabbar wrote: Hello, I was thinking of buying a couple of SSD's until I found out that Trim is only supported with SATA drives. Yes, because TRIM is ATA command. SATA means Serial ATA. SCSI (SAS) drives have WRITE SAME command, which is the equivalent command there. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] AHCI or IDE?
On Thu, Dec 16, 2010 at 08:43:02PM +0100, Alexander Lesle wrote: Hello All, I want to build a home file and media server now. After experiment with a Asus Board and running in unsolve problems I have bought this Supermicro Board X8SIA-F with Intel i3-560 and 8 GB Ram http://www.supermicro.com/products/motherboard/Xeon3000/3400/X8SIA.cfm?IPMI=Y also the LSI HBA SAS 9211-8i http://lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas9211-8i/index.html rpool = 2vdev mirror tank = 2 x 2vdev mirror. For the future I want to have the option to expand up to 12 x 2vdev mirror. After reading the board manual I found at page 4-9 where I can set SATA#1 from IDE to AHCI. Can zfs handle AHCI for rpool? Can zfs handle AHCI for tank? Thx for helping. You definitely want to use AHCI and not the legacy IDE. AHCI enables: - disk hotswap. - NCQ (Native Command Queuing) to execute multiple commands at the same time. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool does not like iSCSI ?
On Tue, Nov 09, 2010 at 04:18:17AM -0800, Andreas Koppenhoefer wrote: From Oracle Support we got the following info: Bug ID: 6992124 reboot of Sol10 u9 host makes zpool FAULTED when zpool uses iscsi LUNs This is a duplicate of: Bug ID: 6907687 zfs pool is not automatically fixed when disk are brought back online or after boot An IDR patch already exists, but no official patch yet. Do you know if these bugs are fixed in Solaris 11 Express ? -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS
On Wed, Nov 17, 2010 at 10:14:10AM +, Bruno Sousa wrote: Hi all, Let me tell you all that the MC/S *does* make a difference...I had a windows fileserver using an ISCSI connection to a host running snv_134 with an average speed of 20-35 mb/s...After the upgrade to snv_151a (Solaris 11 express) this same fileserver got a performance boost and now has an average speed of 55-60mb/s. Not double performance, but WAY better , specially if we consider that this performance boost was purely software based :) Did you verify you're using more connections after the update? Or was is just *other* COMSTAR (and/or kernel) updates making the difference.. -- Pasi Nice...nice job COMSTAR guys! Bruno On Tue, 16 Nov 2010 19:49:59 -0500, Jim Dunham james.dun...@oracle.com wrote: On Nov 16, 2010, at 6:37 PM, Ross Walker wrote: On Nov 16, 2010, at 4:04 PM, Tim Cook [1]...@cook.ms wrote: AFAIK, esx/i doesn't support L4 hash, so that's a non-starter. For iSCSI one just needs to have a second (third or fourth...) iSCSI session on a different IP to the target and run mpio/mpxio/mpath whatever your OS calls multi-pathing. MC/S (Multiple Connections per Sessions) support was added to the iSCSI Target in COMSTAR, now available in Oracle Solaris 11 Express. - Jim -Ross ___ zfs-discuss mailing list [2]zfs-disc...@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- This message has been scanned for viruses and dangerous content by [3]MailScanner, and is believed to be clean. -- Bruno Sousa -- This message has been scanned for viruses and dangerous content by [4]MailScanner, and is believed to be clean. References Visible links 1. mailto:t...@cook.ms 2. mailto:zfs-discuss@opensolaris.org 3. http://www.mailscanner.info/ 4. http://www.mailscanner.info/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling wrote: On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote: So, what would you suggest, if I wanted to create really big pools? Say in the 100 TB range? That would be quite a number of single drives then, especially when you want to go with zpool raid-1. For 100 TB, the methods change dramatically. You can't just reload 100 TB from CD or tape. When you get to this scale you need to be thinking about raidz2+ *and* mirroring. I will be exploring these issues of scale at the Techniques for Managing Huge Amounts of Data tutorial at the USENIX LISA '10 Conference. [1]http://www.usenix.org/events/lisa10/training/ Hopefully your presentation will be available online after the event! -- Pasi -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA [2]http://nexenta-summit2010.eventbrite.com USENIX LISA '10 Conference November 8-16 ZFS and performance consulting [3]http://www.RichardElling.com References Visible links 1. http://www.usenix.org/events/lisa10/training/ 2. http://nexenta-summit2010.eventbrite.com/ 3. http://www.richardelling.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedicated ZIL/L2ARC
On Tue, Sep 14, 2010 at 08:08:42AM -0700, Ray Van Dolson wrote: On Tue, Sep 14, 2010 at 06:59:07AM -0700, Wolfraider wrote: We are looking into the possibility of adding a dedicated ZIL and/or L2ARC devices to our pool. We are looking into getting 4 ??? 32GB Intel X25-E SSD drives. Would this be a good solution to slow write speeds? We are currently sharing out different slices of the pool to windows servers using comstar and fibrechannel. We are currently getting around 300MB/sec performance with 70-100% disk busy. Opensolaris snv_134 Dual 3.2GHz quadcores with hyperthreading 16GB ram Pool_1 ??? 18 raidz2 groups with 5 drives a piece and 2 hot spares Disks are around 30% full No dedup It'll probably help. I'd get two X-25E's for ZIL (and mirror them) and one or two of Intel's lower end X-25M for L2ARC. There are some SSD devices out there with a super-capacitor and significantly higher IOPs ratings than the X-25E that might be a better choice for a ZIL device, but the X-25E is a solid drive and we have many of them deployed as ZIL devices here. I thought Intel SSDs didn't respect CACHE FLUSH command and thus are subject to ZIL corruption if the server crashes or runs out of electricity? -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] carrying on [was: Legality and the future of zfs...]
On Sat, Jul 17, 2010 at 12:57:40AM +0200, Richard Elling wrote: Because of BTRFS for Linux, Linux's popularity itself and also thanks to the Oracle's help. BTRFS does not matter until it is a primary file system for a dominant distribution. From what I can tell, the dominant Linux distribution file system is ext. That will change some day, but we heard the same story you are replaying about BTRFS from the Reiser file system aficionados and the XFS evangelists. There is absolutely no doubt that Solaris will use ZFS as its primary file system. But there is no internal or external force causing Red Hat to change their primary file system from ext. Redhat Fedora 13 includes BTRFS, but it's not used as a default (yet). F13 also supports yum (package management) rollback using BTRFS snapshots. I'm not sure if Fedora 14 will have BTRFS as a default.. RHEL6 beta also includes BTRFS support (tech preview), but again, not enabled as a default filesystem. Upcoming Ubuntu 10.10 will use BTRFS as a default. That's the status in Linux world, afaik :) -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NexentaStor Community edition 3.0.3 released
On Tue, Jun 15, 2010 at 10:57:53PM +0530, Anil Gulecha wrote: Hi All, On behalf of NexentaStor team, I'm happy to announce the release of NexentaStor Community Edition 3.0.3. This release is the result of the community efforts of Nexenta Partners and users. Changes over 3.0.2 include * Many fixes to ON/ZFS backported to b134. * Multiple bug fixes in the appliance. With the addition of many new features, NexentaStor CE is the *most complete*, and feature-rich gratis unified storage solution today. Quick Summary of Features - * ZFS additions: Deduplication (based on OpenSolaris b134). * Free for upto 12 TB of *used* storage * Community edition supports easy upgrades * Many new features in the easy to use management interface. * Integrated search Grab the iso from http://www.nexentastor.org/projects/site/wiki/CommunityEdition If you are a storage solution provider, we invite you to join our growing social network at http://people.nexenta.com. Hey, I tried installing Nexenta 3.0.3 on an old HP DL380G4 server, and it installed ok, but it crashes all the time.. basicly 5-30 seconds after login prompt shows up on the console the server will reboot due to kernel crash. the error seems to be about the broadcom nic driver.. Is this a known bug? See the screenshots for the kernel error message: http://pasik.reaktio.net/nexenta/nexenta303-crash02.jpg http://pasik.reaktio.net/nexenta/nexenta303-crash01.jpg -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Erratic behavior on 24T zpool
On Fri, Jun 18, 2010 at 01:26:11AM -0700, artiepen wrote: Well, I've searched my brains out and I can't seem to find a reason for this. I'm getting bad to medium performance with my new test storage device. I've got 24 1.5T disks with 2 SSDs configured as a zil log device. I'm using the Areca raid controller, the driver being arcmsr. Quad core AMD with 16 gig of RAM OpenSolaris upgraded to snv_134. The zpool has 2 11-disk raidz2's and I'm getting anywhere between 1MB/sec to 40MB/sec with zpool iostat. On average, though it's more like 5MB/sec if I watch while I'm actively doing some r/w. I know that I should be getting better performance. How are you measuring the performance? Do you understand raidz2 with that big amount of disks in it will give you really poor random write performance? -- Pasi I'm new to OpenSolaris, but I've been using *nix systems for a long time, so if there's any more information that I can provide, please let me know. Am I doing anything wrong with this configuration? Thanks in advance. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OCZ Devena line of enterprise SSD
On Thu, Jun 17, 2010 at 09:58:25AM -0700, Ray Van Dolson wrote: On Thu, Jun 17, 2010 at 09:54:59AM -0700, Ragnar Sundblad wrote: On 17 jun 2010, at 18.17, Richard Jahnel wrote: The EX specs page does list the supercap The pro specs page does not. They do for both on the Specifications tab on the web page: http://www.ocztechnology.com/products/solid-state-drives/2-5--sata-ii/maximum-performance-enterprise-solid-state-drives/ocz-vertex-2-pro-series-sata-ii-2-5--ssd-.html But not in the product brief PDFs. It doesn't say how many rewrites you can do either. An Intel X25-E 32G has, according to the product manual, a write endurance of 1 petabyte. In full write speed, 250 MB/s, that is equal to 400 seconds, or about 46 days. (On the other hand you have a five year warranty, and I have been told that you can get them replaced if they wear out.) Do the drives keep any sort of internal counter so you get an idea of how much of the rated drive lifetime you've chewed through? Heh.. the marketing stuff on the 'front' page says: Vertex 2 EX has an ultra-reliable 10 million hour MTBF and comes backed by a three-year warranty. And then on the specifications: MTBF: 2 million hours :) -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Erratic behavior on 24T zpool
On Fri, Jun 18, 2010 at 04:52:02AM -0400, Curtis E. Combs Jr. wrote: I am new to zfs, so I am still learning. I'm using zpool iostat to measure performance. Would you say that smaller raidz2 sets would give me more reliable and better performance? I'm willing to give it a shot... Yes, more smaller raid-sets will give you better performance, since zfs distributes (stripes) data on all of them. What's your IO pattern? random writes? sequential writes? Basicly if you have 2x 11-disk raidz2 sets you'll be limited to around performance of 2 disks, in the worst case of small random IO. (the parity needs to be written and that limits the performance of raidz/z2/z3 to the performance of single disk). This is not really zfs specific at all, it's the same with any raid implementation. -- Pasi On Fri, Jun 18, 2010 at 4:42 AM, Pasi Kärkkäinen pa...@iki.fi wrote: On Fri, Jun 18, 2010 at 01:26:11AM -0700, artiepen wrote: Well, I've searched my brains out and I can't seem to find a reason for this. I'm getting bad to medium performance with my new test storage device. I've got 24 1.5T disks with 2 SSDs configured as a zil log device. I'm using the Areca raid controller, the driver being arcmsr. Quad core AMD with 16 gig of RAM OpenSolaris upgraded to snv_134. The zpool has 2 11-disk raidz2's and I'm getting anywhere between 1MB/sec to 40MB/sec with zpool iostat. On average, though it's more like 5MB/sec if I watch while I'm actively doing some r/w. I know that I should be getting better performance. How are you measuring the performance? Do you understand raidz2 with that big amount of disks in it will give you really poor random write performance? -- Pasi I'm new to OpenSolaris, but I've been using *nix systems for a long time, so if there's any more information that I can provide, please let me know. Am I doing anything wrong with this configuration? Thanks in advance. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Curtis E. Combs Jr. System Administrator Associate University of Georgia High Performance Computing Center ceco...@uga.edu Office: (706) 542-0186 Cell: (706) 206-7289 Gmail Chat: psynoph...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Erratic behavior on 24T zpool
On Fri, Jun 18, 2010 at 05:15:44AM -0400, Thomas Burgess wrote: On Fri, Jun 18, 2010 at 4:42 AM, Pasi Kärkkäinen [1]pa...@iki.fi wrote: On Fri, Jun 18, 2010 at 01:26:11AM -0700, artiepen wrote: Well, I've searched my brains out and I can't seem to find a reason for this. I'm getting bad to medium performance with my new test storage device. I've got 24 1.5T disks with 2 SSDs configured as a zil log device. I'm using the Areca raid controller, the driver being arcmsr. Quad core AMD with 16 gig of RAM OpenSolaris upgraded to snv_134. The zpool has 2 11-disk raidz2's and I'm getting anywhere between 1MB/sec to 40MB/sec with zpool iostat. On average, though it's more like 5MB/sec if I watch while I'm actively doing some r/w. I know that I should be getting better performance. How are you measuring the performance? Do you understand raidz2 with that big amount of disks in it will give you really poor random write performance? -- Pasi i have a media server with 2 raidz2 vdevs 10 drives wide myself without a ZIL (but with a 64 gb l2arc) I can write to it about 400 MB/s over the network, and scrubs show 600 MB/s but it really depends on the type of i/o you haverandom i/o across 2 vdevs will be REALLY slow (as slow as the slowest 2 drives in your pool basically) 40 MB/s might be right if it's randomthough i'd still expect to see more. 7200 RPM SATA disk can do around 120 IOPS max (7200/60 = 120), so if you're doing 4 kB random IO you end up getting 4*120 = 480 kB/sec throughput max from a single disk (in the worst case). 40 MB/sec of random IO throughput using 4 kB IOs would be around 10240 IOPS.. you'd need 85x SATA 7200 RPM disks in raid-0 (striping) for that :) -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Erratic behavior on 24T zpool
On Fri, Jun 18, 2010 at 02:21:15AM -0700, artiepen wrote: 40MB/sec is the best that it gets. Really, the average is 5. I see 4, 5, 2, and 6 almost 10x as many times as I see 40MB/sec. It really only bumps up to 40 very rarely. As far as random vs. sequential. Correct me if I'm wrong, but if I used dd to make files from /dev/zero, wouldn't that be sequential? I measure with zpool iostat 2 in another ssh session while making files of various sizes. Yep, dd will generate sequential IO. Did you specify blocksize for dd? (bs=1024k for example). As a default dd does 4 kB IOs.. which won't be very fast. -- Pasi This is a test system. I'm wondering, now, if I should just reconfigure with maybe 7 disks and add another spare. Seems to be the general consensus that bigger raid pools = worse performance. I thought the opposite was true... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Homegrown Hybrid Storage
On Tue, Jun 08, 2010 at 08:33:40PM -0500, Bob Friesenhahn wrote: On Tue, 8 Jun 2010, Miles Nordin wrote: re == Richard Elling richard.ell...@gmail.com writes: re Please don't confuse Ethernet with IP. okay, but I'm not. seriously, if you'll look into it. Did you misread where I said FC can exert back-pressure? I was contrasting with Ethernet. You're really confused, though I'm sure you're going to deny it. I don't think so. I think that it is time to reset and reboot yourself on the technology curve. FC semantics have been ported onto ethernet. This is not your grandmother's ethernet but it is capable of supporting both FCoE and normal IP traffic. The FCoE gets per-stream QOS similar to what you are used to from Fibre Channel. Quite naturally, you get to pay a lot more for the new equipment and you have the opportunity to discard the equipment you bought already. Yeah, today enterprise iSCSI vendors like Equallogic (bought by Dell) _recommend_ using flow control. Their iSCSI storage arrays are designed to work properly with flow control and perform well. Of course you need a proper (certified) switches aswell. Equallogic says the delays from flow control pause frames are shorter than tcp retransmits, so that's why they're using and recommending it. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Homegrown Hybrid Storage
On Fri, Jun 11, 2010 at 03:30:26PM -0400, Miles Nordin wrote: pk == Pasi Kärkkäinen pa...@iki.fi writes: You're really confused, though I'm sure you're going to deny it. I don't think so. I think that it is time to reset and reboot yourself on the technology curve. FC semantics have been ported onto ethernet. This is not your grandmother's ethernet but it is capable of supporting both FCoE and normal IP traffic. The FCoE gets per-stream QOS similar to what you are used to from Fibre Channel. FCoE != iSCSI. FCoE was not being discussed in the part you're trying to contradict. If you read my entire post, I talk about FCoE at the end and say more or less ``I am talking about FCoE here only so you don't try to throw out my entire post by latching onto some corner case not applying to the OP by dragging FCoE into the mix'' which is exactly what you did. I'm guessing you fired off a reply without reading the whole thing? pk Yeah, today enterprise iSCSI vendors like Equallogic (bought pk by Dell) _recommend_ using flow control. Their iSCSI storage pk arrays are designed to work properly with flow control and pk perform well. pk Of course you need a proper (certified) switches aswell. pk Equallogic says the delays from flow control pause frames are pk shorter than tcp retransmits, so that's why they're using and pk recommending it. please have a look at the three links I posted about flow control not being used the way you think it is by any serious switch vendor, and the explanation of why this limitation is fundamental, not something that can be overcome by ``technology curve.'' It will not hurt anything to allow autonegotiation of flow control on non-broken switches so I'm not surprised they recommend it with ``certified'' known-non-broken switches, but it also will not help unless your switches have input/backplane congestion which they usually don't, or your end host is able to generate PAUSE frames for PCIe congestion which is maybe more plausible. In particular it won't help with the typical case of the ``incast'' problem in the experiment in the FAST incast paper URL I gave, because they narrowed down what was happening in their experiment to OUTPUT queue congestion, which (***MODULO FCoE*** mr ``reboot yourself on the technology curve'') never invokes ethernet flow control. HTH. ok let me try again: yes, I agree it would not be stupid to run iSCSI+TCP over a CoS with blocking storage-friendly buffer semantics if your FCoE/CEE switches can manage that, but I would like to hear of someone actually DOING it before we drag it into the discussion. I don't think that's happening in the wild so far, and it's definitely not the application for which these products have been flogged. I know people run iSCSI over IB (possibly with RDMA for moving the bulk data rather than TCP), and I know people run SCSI over FC, and of course SCSI (not iSCSI) over FCoE. Remember the original assertion was: please try FC as well as iSCSI if you can afford it. Are you guys really saying you believe people are running ***iSCSI*** over the separate HOL-blocking hop-by-hop pause frame CoS's of FCoE meshes? or are you just spewing a bunch of noxious white paper vapours at me? because AIUI people using the lossless/small-output-buffer channel of FCoE are running the FC protocol over that ``virtual channel'' of the mesh, not iSCSI, are they not? I was talking about iSCSI over TCP over IP over Ethernet. No FcOE. No IB. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel X25-E SSD in x4500 followup
On Thu, Jun 10, 2010 at 05:46:19AM -0700, Peter Eriksson wrote: Just a quick followup that the same issue still seems to be there on our X4500s with the latest Solaris 10 with all the latest patches and the following SSD disks: Intel X25-M G1 firmware 8820 (80GB MLC) Intel X25-M G2 firmware 02HD (160GB MLC) What problems did you have with the X25-M models? -- Pasi However - things seem to work smoothly with: Intel X25-E G1 firmware 8850 (32GB SLC) OCZ Vertex 2 firmware 1.00 and 1.02 (100GB MLC) I'm currently testing a setup with dual OCZ Vertex 2 100GB SSD units that will be used both as mirrored boot/root (32GB of the 100GB), and the use the rest of those disks as L2ARC cache devices for the big data zpool. And have two mirrored X25-E as slog devices: zpool create DATA raidz2 c0t0d0 c0t1d0 c1t0d0 c1t1d0 c2t0d0 c2t1d0 c3t1d0 \ raidz2 c4t0d0 c4t1d0 c5t0d0 c5t1d0 c0t2d0 c0t3d0 c3t2d0 \ raidz2 c1t2d0 c1t3d0 c2t2d0 c2t3d0 c4t2d0 c4t3d0 c3t3d0 \ raidz2 c5t2d0 c5t3d0 c0t4d0 c0t5d0 c1t4d0 c1t5d0 c3t5d0 \ raidz2 c2t4d0 c2t5d0 c4t4d0 c4t5d0 c5t4d0 c5t5d0 c3t6d0 \ raidz2 c0t6d0 c0t7d0 c1t6d0 c1t7d0 c2t6d0 c2t7d0 c3t7d0 \ spare c4t6d0 c5t6d0 \ cache c3t0d0s3 c3t4d0s3 \ log mirror c4t7d0 c5t7d0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nfs share of nested zfs directories?
On Fri, Jun 04, 2010 at 08:43:32AM -0400, Cassandra Pugh wrote: Thank you, when I manually mount using the mount -t nfs4 option, I am able to see the entire tree, however, the permissions are set as nfsnobody. Warning: rpc.idmapd appears not to be running. All uids will be mapped to the nobody uid. Did you actually read the error message? :) Finding a solution shouldn't be too difficult after that.. -- Pasi - Cassandra (609) 243-2413 Unix Administrator From a little spark may burst a mighty flame. -Dante Alighieri On Thu, Jun 3, 2010 at 4:33 PM, Brandon High [1]bh...@freaks.com wrote: On Thu, Jun 3, 2010 at 12:50 PM, Cassandra Pugh [2]cp...@pppl.gov wrote: The special case here is that I am trying to traverse NESTED zfs systems, for the purpose of having compressed and uncompressed directories. Make sure to use mount -t nfs4 on your linux client. The standard nfs type only supports nfs v2/v3. -B -- Brandon High : [3]bh...@freaks.com References Visible links 1. mailto:bh...@freaks.com 2. mailto:cp...@pppl.gov 3. mailto:bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [ZIL device brainstorm] intel x25-M G2 has ram cache?
On Tue, May 25, 2010 at 10:08:57AM +0100, Karl Pielorz wrote: --On 24 May 2010 23:41 -0400 rwali...@washdcmail.com wrote: I haven't seen where anyone has tested this, but the MemoRight SSD (sold by RocketDisk in the US) seems to claim all the right things: http://www.rocketdisk.com/vProduct.aspx?ID=1 pdf specs: http://www.rocketdisk.com/Local/Files/Product-PdfDataSheet-1_MemoRight%20 SSD%20GT%20Specification.pdf They claim to support the cache flush command, and with respect to DRAM cache backup they say (p. 14/section 3.9 in that pdf): At the risk of this getting a little off-topic (but hey, we're all looking for ZFS ZIL's ;) We've had similar issues when looking at SSD's recently (lack of cache protection during power failure) - the above SSD's look interesting [finally someone's noted you need to protect the cache] - but from what I've read about the Intel X25-E performance - the Intel drive with write cache turned off appears to be as fast, if not faster than those drives anyway... I've tried contacting Intel to find out if it's true their enterprise SSD has no cache protection on it, and what the effect of turning the write cache off would have on both performance and write endurance, but not heard anything back yet. I guess the problem is not the cache by itself, but the fact that they ignore the CACHE FLUSH command.. and thus the non-battery-backed cache becomes a problem. -- Pasi Picking apart the Intel benchmarks published - they always have the write-cache enabled, which probably speaks volumes... -Karl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [ZIL device brainstorm] intel x25-M G2 has ram cache?
On Tue, May 25, 2010 at 01:52:47PM +0100, Karl Pielorz wrote: --On 25 May 2010 15:28 +0300 Pasi Kärkkäinen pa...@iki.fi wrote: I've tried contacting Intel to find out if it's true their enterprise SSD has no cache protection on it, and what the effect of turning the write cache off would have on both performance and write endurance, but not heard anything back yet. I guess the problem is not the cache by itself, but the fact that they ignore the CACHE FLUSH command.. and thus the non-battery-backed cache becomes a problem. The X25-E's do apparently honour the 'Disable Write Cache' command - without write cache, there is no cache to flush - all data is written to flash immediately - presumably before it's ACK'd to the host. I've seen a number of other sites do some testing with this - and found that it 'works' (i.e. with write-cache enabled, you get nasty data loss if the power is lost - with it disabled, it closes that window). But you obviously take quite a sizeable performance hit. Yeah.. what I meant is: if you have write cache enabled, and the ssd drive honours 'CACHE FLUSH' command, then you should be safe.. Based on what I've understood the Intel SSDs ignore the CACHE FLUSH command, and thus it's not safe to run them with caches enabled.. We've got an X25-E here which we intend to test for ourselves (wisely ;) - to make sure that is the case... Please let us know how it goes :) -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using WD Green drives?
On Mon, May 17, 2010 at 03:12:44PM -0700, Erik Trimble wrote: On Mon, 2010-05-17 at 12:54 -0400, Dan Pritts wrote: On Mon, May 17, 2010 at 06:25:18PM +0200, Tomas Ögren wrote: Resilver does a whole lot of random io itself, not bulk reads.. It reads the filesystem tree, not block 0, block 1, block 2... You won't get 60MB/s sustained, not even close. Even with large, unfragmented files? danno -- Dan Pritts, Sr. Systems Engineer Internet2 office: +1-734-352-4953 | mobile: +1-734-834-7224 Having large, unfragmented files will certainly help keep sustained throughput. But, also, you have to consider the amount of deletions done on the pool. For instance, let's say you wrote files A, B, and C one right after another, and they're all big files. Doing a re-silver, you'd be pretty well off on getting reasonable throughput reading A, then B, then C, since they're going to be contiguous on the drive (both internally, and across the three files). However, if you have deleted B at some time, and say wrote a file D (where D B in size) into B's old space, then, well, you seek to A, read A, seek forward to C, read C, seek back to D, etc. Thus, you'll get good throughput for resilver on these drives pretty much in just ONE case: large files with NO deletions. If you're using them for write-once/read-many/no-delete archives, then you're OK. Anything else is going to suck. :-) So basicly if you have a lot of small files with a lot of changes and deletions resilver is going to be really slow. Sounds like the traditional RAID would be better/faster to rebuild in this case.. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS
On Sat, May 15, 2010 at 11:01:00AM +, Marc Bevand wrote: I have done quite some research over the past few years on the best (ie. simple, robust, inexpensive, and performant) SATA/SAS controllers for ZFS. Especially in terms of throughput analysis (many of them are designed with an insufficient PCIe link width). I have seen many questions on this list about which one to buy, so I thought I would share my knowledge: http://blog.zorinaq.com/?e=10 Very briefly: - The best 16-port one is probably the LSI SAS2116, 6Gbps, PCIe (gen2) x8. Because it is quite pricey, it's probably better to buy 2 8-port controllers. - The best 8-port is the LSI SAS2008 (faster, more expensive) or SAS1068E (150MB/s/port should be sufficient). - The best 2-port is the Marvell 88SE9128 or 88SE9125 or 88SE9120 because of PCIe gen2 allowing a throughput of at least 300MB/s on the PCIe link with Max_Payload_Size=128. And this one is particularly cheap ($35). AFAIK this is the _only_ controller of the entire market allowing 2 drives to not bottleneck an x1 link. I hope this helps ZFS users here! Excellent post! It'll definitely help many. Thanks! -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of the ZIL
On Wed, May 05, 2010 at 11:32:23PM -0400, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Robert Milkowski if you can disable ZIL and compare the performance to when it is off it will give you an estimate of what's the absolute maximum performance increase (if any) by having a dedicated ZIL device. I'll second this suggestion. It'll cost you nothing to disable the ZIL temporarily. (You have to dismount the filesystem twice. Once to disable the ZIL, and once to re-enable it.) Then you can see if performance is good. If performance is good, then you'll know you need to accelerate your ZIL. (Because disabled ZIL is the fastest thing you could possibly ever do.) Generally speaking, you should not disable your ZIL for the long run. But in some cases, it makes sense. Here's how you determine if you want to disable your ZIL permanently: First, understand that with the ZIL disabled, all sync writes are treated as async writes. This is buffered in ram before being written to disk, so the kernel can optimize and aggregate the write operations into one big chunk. No matter what, if you have an ungraceful system shutdown, you will lose all the async writes that were waiting in ram. If you have ZIL disabled, you will also lose the sync writes that were waiting in ram (because those are being handled as async.) In neither case do you have data or filesystem corruption. ZFS probably is still OK, since it's designed to handle this (?), but the data can't be OK if you lose 30 secs of writes.. 30 secs of writes that have been ack'd being done to the servers/applications.. The risk of running with no ZIL is: In the case of ungraceful shutdown, in addition to the (up to 30 sec) async writes that will be lost, you will also lose up to 30 sec of sync writes. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss