[zfs-discuss] ZFS Import failed: Request rejected: too large for CDB
Hello, I am new to this list but i have a big Problem: We have a Sun Fire V440 with an SCSI RAID system connected. I can see all the devices and Partitions. After a failure in the UPS-System the Zpool is not accessible anymore. The Zpool is a normal stripe over 4 Partitions . First we made a zpool export Produktion to keep the pool in order. But now we can not import the pool anymore and we get the following error: The Command was: zpool import -f Produktion Nov 25 10:41:06 MDSP scsi: WARNING: /p...@1f,70/s...@2,1/s...@0,0 (sd25): Nov 25 10:41:06 MDSPRequest rejected: too large for CDB: lba:0x30022cbd2 len:0x0010 Nov 25 10:41:06 MDSP scsi: WARNING: /p...@1f,70/s...@2,1/s...@0,0 (sd25): Nov 25 10:41:06 MDSPRequest rejected: too large for CDB: lba:0x30022cbd2 len:0x0010 Nov 25 10:41:06 MDSP scsi: WARNING: /p...@1f,70/s...@2,1/s...@0,0 (sd25): Nov 25 10:41:06 MDSPRequest rejected: too large for CDB: lba:0x30022cdd2 len:0x0010 Nov 25 10:41:06 MDSP scsi: WARNING: /p...@1f,70/s...@2,1/s...@0,0 (sd25): Nov 25 10:41:06 MDSPRequest rejected: too large for CDB: lba:0x30022cdd2 len:0x0010 Nov 25 10:41:06 MDSP scsi: WARNING: /p...@1f,70/s...@2,1/s...@0,0 (sd25): Nov 25 10:41:06 MDSPRequest rejected: too large for CDB: lba:0x30022cdd2 len:0x0010 Pool: Produktion id: 64650935418607444 state: FAULTED Zustand: Die Pool-Metadaten sind beschädigt. Aktion: Der Pool kann aufgrund von beschädigten Geräten oder Daten nicht importiert werden. Der Pool ist eventuell auf einem anderen System aktiv, kann aber mit dem Flag '-f' importiert werden. Siehe: http://www.sun.com/msg/ZFS-8000-72 config: Produktion FAULTED Beschädigte Daten c7t0d1ONLINE c7t0d2ONLINE c7t0d3ONLINE c7t0d4ONLINE is there any chance to import the pool again? Thanks for Help in advanced Mit freundlichen Grüßen Ortwin Herbst Gepr. Wirtschaftsinformatiker Rädler GmbH EDV-Systeme für das Grafische Gewerbe Conradtystraße 43 90441 Nürnberg Telefon: 09 11 / 9 56 61 00 Telefax: 09 11 / 9 56 61 80 Mobil: 01 72 / 8 64 68 33 Sitz der Gesellschaft: Landsberg/Lech Registergericht: Augsburg Handelsregisternummer: HRB19775 Geschäftsführer: Josef Jordan, Wolfgang Rädler = Für jedes Problem gibt es eine Lösung: die einfache, die schnelle und die falsche. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Import failed: Request rejected: too large for CDB
Your pool is on a device that requires a 16 byte CDB to address the entire LUN. That is the LUN is more than 2Tb in size. However the host bus adapter driver that is being used does not support 16byte CDBs. Quite how you got into this situation, ie how you could create the volume I don't know, unless you have grown the LUN since the pool was created or somehow the host bus adapter driver has been downgraded since the pool was created. --chris -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128
Daniel Carosone d...@geek.com.au writes: you can fetch the cr_txg (cr for creation) for a snapshot using zdb, yes, but this is hardly an appropriate interface. agreed. zdb is also likely to cause disk activity because it looks at many things other than the specific item in question. I'd expect meta-information like this to fit comfortably in RAM over extended amounts of time. haven't tried, though. but the very creation of a snapshot requires a new txg to note that fact in the pool. yes, which is exactly what we're trying to avoid, because it requires disk activity to write. you missed my point: you can't compare the current txg to an old cr_txg directly, since the current txg value will be at least 1 higher, even if no changes have been made. if the snapshot is taken recursively, all snapshots will have the same cr_txg, but that requires the same configuration for all filesets. again, yes, but that's irrelevant - the important knowledge at this moment is that the txg has not changed since last time, and that thus there will be no benefit in taking further snapshots, regardless of configuration. yes, that's what we're trying to establish, and it's easier when all snapshots are commited in the same txg. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Random Read Performance
Richard, First, thank you for the detailed reply ... (comments in line below) On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling richard.ell...@gmail.com wrote: more below... On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote: On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling richard.ell...@gmail.com wrote: Try disabling prefetch. Just tried it... no change in random read (still 17-18 MB/sec for a single thread), but sequential read performance dropped from about 200 MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file accessed in 256 KB records. ARC is set to a max of 1 GB for testing. arcstat.pl shows that the vast majority (95%) of reads are missing the cache. hmmm... more testing needed. The question is whether the low I/O rate is because of zfs itself, or the application? Disabling prefetch will expose the application, because zfs is not creating additional and perhaps unnecessary read I/O. The values reported by iozone are in pretty close agreement with what we are seeing with iostat during the test runs. Compression is off on zfs (the iozone test data compresses very well and yields bogus results). I am looking for a good alternative to iozone for random testing, I did put together a crude script to spawn many dd processes accessing the block device itself, each with a different seek over the range of the disk and saw results much greater than the iozone single threaded random performance. Your data which shows the sequential write, random write, and sequential read driving actv to 35 is because prefetching is enabled for the read. We expect the writes to drive to 35 with a sustained write workload of any flavor. Understood. I tried tuning the queue size to 50 and observed that the actv went to 50 (with very little difference in performance), so returned it to the default of 35. The random read (with cache misses) will stall the application, so it takes a lot of threads (16?) to keep 35 concurrent I/Os in the pipeline without prefetching. The ZFS prefetching algorithm is intelligent so it actually complicates the interpretation of the data. What bothers me is that that iostat is showing the 'disk' device as not being saturated during the random read test. I'll post iostat output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You can clearly see the various test phases (sequential write, rewrite, sequential read, reread, random read, then random write). You're peaking at 658 256KB random IOPS for the 3511, or ~66 IOPS per drive. Since ZFS will max out at 128KB per I/O, the disks see something more than 66 IOPS each. The IOPS data from iostat would be a better metric to observe than bandwidth. These drives are good for about 80 random IOPS each, so you may be close to disk saturation. The iostat data for IOPS and svc_t will confirm. But ... if I am saturating the 3511 with one thread, then why do I get many times that performance with multiple threads ? The T2000 data (sheet 3) shows pretty consistently around 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20% less than I would expect, perhaps due to the measurement. I ran the T2000 test to see if 10U8 behaved better and to make sure I wasn't seeing an oddity of the 480 / 3511 case. I wanted to see if the random read bahavior was similar, and it was (in relative terms). Also, the 3511 RAID-5 configuration will perform random reads at around 1/2 IOPS capacity if the partition offset is 34. This was the default long ago. The new default is 256. Our 3511's have been running 421F (latest) for a long time :-) We are religious about keeping all the 3511 FW current and matched. The reason is that with a 34 block offset, you are almost guaranteed that a larger I/O will stride 2 disks. You won't notice this as easily with a single thread, but it will be measurable with more threads. Double check the offset with prtvtoc or format. How do I check offset ... format - verify from one of the partitionsis below: format ver Volume name = ascii name = SUN-StorEdge 3511-421F-517.23GB bytes/sector= 512 sectors = 1084710911 accessible sectors = 1084710878 Part TagFlag First Sector Size Last Sector 0usrwm 256 517.22GB 1084694494 1 unassignedwm 000 2 unassignedwm 000 3 unassignedwm 000 4 unassignedwm 000 5 unassignedwm 000 6 unassignedwm 000 8 reservedwm1084694495 8.00MB 1084710878 format Writes are a completely different matter. ZFS has a tendency to turn random writes into sequential writes, so it is pretty much useless to look at random write
Re: [zfs-discuss] sharemgr
On Nov 24, 2009, at 3:41 PM, dick hoogendijk wrote: I have a solution with use zfs set sharenfs=rw,nosuid zpool but i prefer use the sharemgr command. Then you prefere wrong. ZFS filesystems are not shared this way. Read up on ZFS and NFS. It can also be done with sharemgr. Shaving via ZFS creates a sharemgr group called 'zfs', but you can also share things directly via the sharemgr commands. It is fairly well spelled out in the manpage: http://docs.sun.com/app/docs/doc/819-2240/sharemgr-1m?a=view Basically you want to create a group, set the group's properties and add a share to the group. --Ware ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharemgr
dick hoogendijk wrote: glidic anthony wrote: I have a solution with use zfs set sharenfs=rw,nosuid zpool but i prefer use the sharemgr command. Then you prefere wrong. To each their own. ZFS filesystems are not shared this way. They can be. I do it all the time. There's nothing technical that dictates that sharemgr can't be used on ZFS filesystems. Just because ZFS provides an alternate way, that doesn't make it the only way, or even the 'one true way.' About the only advantage I can see of using zfs share, is inheritance. If you don't need that, then sharemgr is just as good, and there are cases where it may be simpler - For instance, I loopback mount many many ISO's, and need to use sharemgr to share those anyway, I find it much more convienent to manage all my shares in one place with one tool. If sharemgr could (optionally) manage inherited sharing on ZFS filesystems, then I think it'd be cleaner to suggest to users to use the one system-wide sharing tool, rather that one that only works for one filesystem. I can't remember them right now, but I think there are other commands where ZFS seems to have done the same thing and I can't figure out why that's the trend? As great as ZFS is, it won't ever be the only filesystem around, ISOs (at least) will be around for a long time still. Why start forcing users to learn new tools for each filesystem type? Read up on ZFS and NFS. What make you think he didn't? While the docs do describe how you can optionally use zfs share (which he clearly read about since he mentioned it) they don't prohibit using sharemgr. I read his question as How can I get sharemgr to setup sharing so that it get inherited on child filesystems? Apparently the answer to that question is You can't. If you want to set it up only once you need zfs share, and if you really want to use sharemgr you need to share each filesystem separately. Maybe someday that will change. -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Random Read Performance
I posted baseline stats at http://www.ilk.org/~ppk/Geek/ baseline test was 1 thread, 3 GiB file, 64KiB to 512 KiB record size 480-3511-baseline.xls is an iozone output file iostat-baseline.txt is the iostat output for the device in use (annotated) I also noted an odd behavior yesterady and have not had a chance to better qualify it. I was testing various combinations of vdev quantities and mirror quantities. As I changed the number of vdevs (stripes) from 1 through 8 (all backed buy paritions on the same logical disk on the 3511) there was no real change in sequential write, random write, or random read performance. Sequential read performance did show a drop from 216 MiB/sec at 1 vdev to 180 MiB/sec. at 8 vdevs. This was about as expected. As I changed the number of mirro components things got interesting. Keep in mind that I only have one 3511 for testing right now, I had to use partitions from two other production 3511's to get three mirror components on different arrays. As expected, as I went from 1 to 2 to 3 mirror components the write performance did not change, but the read performance was interesting... see below: read performance mirrors sequential random 1 174 MiB/sec. 23 MiB/sec. 2 229 MiB/sec. 30 MiB/sec. 3 223 MiB/sec. 125 MiB/sec. What they heck happened here ? 1 to 2 mirrors saw a large increase in sequential read perfromance and from 2 to 3 mirrors show a HUGE increase in random read performance. It feels like the behavior of the zfs code changed between 2 and 3 mirrors for the random read data. Now to investigate further, I tried multiple mirrors components on the same array (my test 3511), not that you would do this in production, but I was curious what would happen. In this case the throughput degraded across the board as I added mirror components, as one would expect. In the random read case the array was delivering less overall performance than it was when it was one part of the earlier test (16 MiB/sec. combined vs. 1/3 of 125 MiB/sec.) See sheet 7 of http://www.ilk.org/~ppk/Geek/throughput-summary.ods for these test results. Sheet 8 is the last test I did last night, using the NRAID logical disk type to try to get the 3511 to pass a disk through to zfs, but get the advantage of the cache on the 3511. I'm not sure what to read into those numbers. -- {1-2-3-4-5-6-7-} Paul Kraus - Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) - Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) - Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) - Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] heads-up: dedup=fletcher4,verify was broken
When will SXCE 129 be released since 128 was passed over? There used to be a release calendar on opensolaris.org but I can't find it anymore. Jeff Bonwick wrote: And, for the record, this is my fault. There is an aspect of endianness that I simply hadn't thought of. When I have a little more time I will blog about the whole thing, because there are many useful lessons here. Thank you, Matt, for all your help with this. And my apologies to everyone else for the disruption. Jeff On Mon, Nov 23, 2009 at 09:15:48PM -0800, Matthew Ahrens wrote: We discovered another, more fundamental problem with dedup=fletcher4,verify. I've just putback the fix for: 6904243 zpool scrub/resilver doesn't work with cross-endian dedup=fletcher4,verify blocks The same instructions as below apply, but in addition, the dedup=fletcher4,verify functionality has been removed. We will investigate whether it's possible to fix these isses and re-enable this functionality. --matt Matthew Ahrens wrote: If you did not do zfs set dedup=fletcher4,verify fs (which is available in build 128 and nightly bits since then), you can ignore this message. We have changed the on-disk format of the pool when using dedup=fletcher4,verify with the integration of: 6903705 dedup=fletcher4,verify doesn't byteswap correctly, has lots of hash collisions This is not the default dedup setting; pools that only used zfs set dedup=on (or =sha256, or =verify, or =sha256,verify) are unaffected. Before installing bits with this fix, you will need to destroy any filesystems that have had dedup=fletcher4,verify set on them. You can preserve your existing data by running: zfs set dedup=any other setting old fs zfs snapshot -r old fs@snap zfs create new fs zfs send -R old fs@snap | zfs recv -d new fs zfs destroy -r old fs Simply changing the setting from dedup=fletcher4,verify to another setting is not sufficient, as this does not modify existing data. You can verify that your pool isn't using dedup=fletcher4,verify by running zdb -D pool | grep DDT-fletcher4 If there are no matches, your pool is not using dedup=fletcher4,verify, and it is safe to install bits with this fix. Build 128 will be respun to include this fix. Sorry for the inconvenience, -- team zfs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss CONFIDENTIALITY NOTICE: This communication (including all attachments) is confidential and is intended for the use of the named addressee(s) only and may contain information that is private, confidential, privileged, and exempt from disclosure under law. All rights to privilege are expressly claimed and reserved and are not waived. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. If you have received this communication in error, please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-raidz - simulate disk failure
On Nov 24, 2009, at 2:51 PM, Daniel Carosone wrote: Those are great, but they're about testing the zfs software. There's a small amount of overlap, in that these injections include trying to simulate the hoped-for system response (e.g, EIO) to various physical scenarios, so it's worth looking at for scenario suggestions. However, for most of us, we generally rely on Sun's (generally acknowledged as excellent) testing of the software stack. I suspect the OP is more interested in verifying on his own hardware, that physical events and problems will be connected to the software fault injection test scenarios. The rest of us running on random commodity hardware have largely the same interest, because Sun hasn't qualified the hardware parts of the stack as well. We've taken on that responsibility ourselves (both individually, and as a community by sharing findings). Agree 110%. For example, for the various kinds of failures that might happen: * Does my particular drive/controller/chipset/bios/etc combination notice the problem and result in the appropriate error from the driver upwards? * How quickly does it notice? Do I have to wait for some long timeout or other retry cycle, and is that a problem for my usage? * Does the rest of the system keep working to allow zfs to recover/ react, or is there some kind of follow-on failure (bus hangs/resets, etc) that will have wider impact? Yanking disk controller and/or power cables is an easy and obvious test. Testing scenarios that involve things like disk firmware behaviour in response to bad reads is harder - though apparently yelling at them might be worthwhile :-) The problem is that yanking a disk tests the failure mode of yanking a disk. If this is the sort of failure you expect to see, then perhaps you should look at a mechanical solution. If you wish to test the failure modes you are likely to see, then you need a more sophisticated test rig that will emulate a device and inject the sorts of faults you expect. Finding ways to dial up the load up your psu (or drop voltage/limit current to a specific device with an inline filter) might be an idea, since overloaded power supplies seem to be implicated in various people's reports of trouble. Finding ways to generate EMF or cosmic rays to induce other kinds of failure is left as an exercise. Many parts of the stack have software fault injection capabilities. Whether you do this with something like zinject or the wansimulator, the principle is the same. For example, you could easily add wansimulator to an iSCSI rig to inject packet corruption in the network. You can also roll your own with Dtrace, which allows you to change the return values of any function. The COMSTAR project has a test suite that could be leveraged, but it does not appear to be explicitly designed to perform system tests. I'm reasonably confident that the driver teams have test code, too, but I would also expect them to be oriented towards unit testing. A quick search will turn up many fault injection software programs geared towards unit testing. Finally, there are companies that provide system-level test services. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharemgr
On Wed, 2009-11-25 at 10:00 -0500, Kyle McDonald wrote: To each their own. [cut the rest of your reply] In general: I stand corrected. I was rude. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Random Read Performance
If you are using (3) 3511's, then won't it be possibly that your 3GB workload will be largely or entirely served out of RAID controller cache? Also, I had a question for your production backups (millions of small files), do you have atime=off set for the filesystems? That might be helpful. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] heads-up: dedup=fletcher4,verify was broken
Maybe 11/30/2009 ? According to http://hub.opensolaris.org/bin/view/Community+Group+on/schedule. we have onnv_129 11/23/2009 11/30/2009 But..as far as i know those release dates are in a best effort basis. Bruno Karl Rossing wrote: When will SXCE 129 be released since 128 was passed over? There used to be a release calendar on opensolaris.org but I can't find it anymore. Jeff Bonwick wrote: And, for the record, this is my fault. There is an aspect of endianness that I simply hadn't thought of. When I have a little more time I will blog about the whole thing, because there are many useful lessons here. Thank you, Matt, for all your help with this. And my apologies to everyone else for the disruption. Jeff On Mon, Nov 23, 2009 at 09:15:48PM -0800, Matthew Ahrens wrote: We discovered another, more fundamental problem with dedup=fletcher4,verify. I've just putback the fix for: 6904243 zpool scrub/resilver doesn't work with cross-endian dedup=fletcher4,verify blocks The same instructions as below apply, but in addition, the dedup=fletcher4,verify functionality has been removed. We will investigate whether it's possible to fix these isses and re-enable this functionality. --matt Matthew Ahrens wrote: If you did not do zfs set dedup=fletcher4,verify fs (which is available in build 128 and nightly bits since then), you can ignore this message. We have changed the on-disk format of the pool when using dedup=fletcher4,verify with the integration of: 6903705 dedup=fletcher4,verify doesn't byteswap correctly, has lots of hash collisions This is not the default dedup setting; pools that only used zfs set dedup=on (or =sha256, or =verify, or =sha256,verify) are unaffected. Before installing bits with this fix, you will need to destroy any filesystems that have had dedup=fletcher4,verify set on them. You can preserve your existing data by running: zfs set dedup=any other setting old fs zfs snapshot -r old fs@snap zfs create new fs zfs send -R old fs@snap | zfs recv -d new fs zfs destroy -r old fs Simply changing the setting from dedup=fletcher4,verify to another setting is not sufficient, as this does not modify existing data. You can verify that your pool isn't using dedup=fletcher4,verify by running zdb -D pool | grep DDT-fletcher4 If there are no matches, your pool is not using dedup=fletcher4,verify, and it is safe to install bits with this fix. Build 128 will be respun to include this fix. Sorry for the inconvenience, -- team zfs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss CONFIDENTIALITY NOTICE: This communication (including all attachments) is confidential and is intended for the use of the named addressee(s) only and may contain information that is private, confidential, privileged, and exempt from disclosure under law. All rights to privilege are expressly claimed and reserved and are not waived. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. If you have received this communication in error, please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (home NAS) zfs and spinning down of drives
Jim Sez: Like many others, I've come close to making a home NAS server based on ZFS and OpenSolaris. While this is not an enterprise solution with high IOPS expectation, but rather a low-power system for storing everything I have, I plan on cramming in some 6-10 5400RPM Green drives with low wattage and high capacity, and possibly an SSD or two (or one-two spinning disks) for Read/Write caching/logging. Hey! Me too! I'm up to buying hardware new to make it run. Having read through the thread, I wonder is the best solution might not be to make a minimal NAS-only box with a mirrored pair(s) of drives for the daily updates, and spinning this off at intervals via cron jobs or some such to long(er) term and safer storage in a second system that's the main raidz repository. Sure it's more elegant to have the momentary cache and safe repository on the same set of hardware, but for another $200 one can get a second whole system to work as the cache and take all the on/off cycles, then power on the main backing store system when something from deep freeze storage is needed, but keeping the recent working set in the cache system. This lets you schedule (for cheap electricity) the operations of the deep freeze backing storage, while keeping its disks mostly off, and minimizing power cycles on the disks down to as little as 1/day. Elegance is nice, but there are some places where more hardware can take it's place more quickly. Can you tell I'm at heart a hardware guy? 8-) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Opensolaris with J4400 - Experiences
Hello ! I'm currently using a X2200 with a LSI HBA connected to a Supermicro JBOD chassis, however i want to have more redundancy in the JBOD. So i have looked into to market, and into to the wallet, and i think that the Sun J4400 suits nicely to my goals. However i have some concerns and if anyone can give some suggestions i would trully appreciate. And now for my questions : * Will i be able to achieve multipath support, if i connect the J4400 to 2 LSI HBA in one server, with SATA disks, or this is only possible with SAS disks? This server will have OpenSolaris (any release i think) . * The CAM ( StorageTek Common Array Manager ), its only for hardware management of the JBOD, leaving disk/volumes/zpools/luns/whatever_name management up to the server operating system , correct ? * Can i put some readzillas/writezillas in the j4400 along with sata disks, and if so will i have any benefit , or should i place those *zillas directly into the servers disk tray? * Does any one has experiences with those jbods? If so, are they in general solid/reliable ? * The server will probably be a Sun x44xx series, with 32Gb ram, but for the best possible performance, should i invest in more and more spindles, or a couple less spindles and buy some readzillas? This system will be mainly used to export some volumes over ISCSI to a windows 2003 fileserver, and to hold some NFS shares. Thank you for all your time, Bruno smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Random Read Performance
On Wed, Nov 25, 2009 at 7:54 AM, Paul Kraus pk1...@gmail.com wrote: You're peaking at 658 256KB random IOPS for the 3511, or ~66 IOPS per drive. Since ZFS will max out at 128KB per I/O, the disks see something more than 66 IOPS each. The IOPS data from iostat would be a better metric to observe than bandwidth. These drives are good for about 80 random IOPS each, so you may be close to disk saturation. The iostat data for IOPS and svc_t will confirm. But ... if I am saturating the 3511 with one thread, then why do I get many times that performance with multiple threads ? I'm having troubles making sense of the iostat data (I can't tell how many threads at any given point), but I do see lots of times where asvc_t * reads is in the range 850 ms to 950 ms. That is, this is as fast as a single threaded app with a little bit of think time can issue reads (100 reads * 9 ms svc_t + 100 reads * 1 ms think_time = 1 sec). The %busy shows that 90+% of the time there is an I/O in flight (100 reads * 9ms = 900/1000 = 90%). However, %busy isn't aware of how many I/O's could be in flight simultaneously. When you fire up more threads, you are able to have more I/O's in flight concurrently. I don't believe that the I/O's per drive is really a limiting factor at the single threaded case, as the spec sheet for the 3511 says that it has 1 GB of cache per controller. Your working set is small enough that it is somewhat likely that many of those random reads will be served from cache. A dtrace analysis of just how random the reads are would be interesting. I think that hotspot.d from the DTrace Toolkit would be a good starting place. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Random Read Performance
more below... On Nov 25, 2009, at 5:54 AM, Paul Kraus wrote: Richard, First, thank you for the detailed reply ... (comments in line below) On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling richard.ell...@gmail.com wrote: more below... On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote: On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling richard.ell...@gmail.com wrote: Try disabling prefetch. Just tried it... no change in random read (still 17-18 MB/sec for a single thread), but sequential read performance dropped from about 200 MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file accessed in 256 KB records. ARC is set to a max of 1 GB for testing. arcstat.pl shows that the vast majority (95%) of reads are missing the cache. hmmm... more testing needed. The question is whether the low I/O rate is because of zfs itself, or the application? Disabling prefetch will expose the application, because zfs is not creating additional and perhaps unnecessary read I/O. The values reported by iozone are in pretty close agreement with what we are seeing with iostat during the test runs. Compression is off on zfs (the iozone test data compresses very well and yields bogus results). I am looking for a good alternative to iozone for random testing, I did put together a crude script to spawn many dd processes accessing the block device itself, each with a different seek over the range of the disk and saw results much greater than the iozone single threaded random performance. filebench is usually bundled in /usr/benchmarks or as a pkg. vdbench is easy to use and very portable, www.vdbench.org Your data which shows the sequential write, random write, and sequential read driving actv to 35 is because prefetching is enabled for the read. We expect the writes to drive to 35 with a sustained write workload of any flavor. Understood. I tried tuning the queue size to 50 and observed that the actv went to 50 (with very little difference in performance), so returned it to the default of 35. Yep, bottleneck is on the back end (physical HDDs). For arrays with lots of HDDs, this queue can be deeper, but the 3500 series is way too small to see this. If SSDs are used on the back end, then you can revisit this. From the data, it does look like the random read tests are converging on the media capabilities of the disks in the array. For the array you can see the read-modify-write penalty of RAID-5 as well as the caching and prefetching of reads. Note: the physical I/Os are 128 KB, regardless of the iozone size setting. This is expected, since 128 KB is the default recordsize limit for ZFS. The random read (with cache misses) will stall the application, so it takes a lot of threads (16?) to keep 35 concurrent I/Os in the pipeline without prefetching. The ZFS prefetching algorithm is intelligent so it actually complicates the interpretation of the data. What bothers me is that that iostat is showing the 'disk' device as not being saturated during the random read test. I'll post iostat output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You can clearly see the various test phases (sequential write, rewrite, sequential read, reread, random read, then random write). Is this a single thread? Usually this means that you aren't creating enough load. ZFS won't be prefetching (as much) for a random read workload, so iostat will expose client bottlenecks. You're peaking at 658 256KB random IOPS for the 3511, or ~66 IOPS per drive. Since ZFS will max out at 128KB per I/O, the disks see something more than 66 IOPS each. The IOPS data from iostat would be a better metric to observe than bandwidth. These drives are good for about 80 random IOPS each, so you may be close to disk saturation. The iostat data for IOPS and svc_t will confirm. But ... if I am saturating the 3511 with one thread, then why do I get many times that performance with multiple threads ? The T2000 data (sheet 3) shows pretty consistently around 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20% less than I would expect, perhaps due to the measurement. I ran the T2000 test to see if 10U8 behaved better and to make sure I wasn't seeing an oddity of the 480 / 3511 case. I wanted to see if the random read bahavior was similar, and it was (in relative terms). Also, the 3511 RAID-5 configuration will perform random reads at around 1/2 IOPS capacity if the partition offset is 34. This was the default long ago. The new default is 256. Our 3511's have been running 421F (latest) for a long time :-) We are religious about keeping all the 3511 FW current and matched. The reason is that with a 34 block offset, you are almost guaranteed that a larger I/O will stride 2 disks. You won't notice this as easily with a single thread, but it will be measurable with more threads. Double check the offset with prtvtoc or format. How do I check offset ... format - verify from one of the
Re: [zfs-discuss] ZFS Random Read Performance
more below... On Nov 25, 2009, at 7:10 AM, Paul Kraus wrote: I posted baseline stats at http://www.ilk.org/~ppk/Geek/ baseline test was 1 thread, 3 GiB file, 64KiB to 512 KiB record size 480-3511-baseline.xls is an iozone output file iostat-baseline.txt is the iostat output for the device in use (annotated) I also noted an odd behavior yesterady and have not had a chance to better qualify it. I was testing various combinations of vdev quantities and mirror quantities. As I changed the number of vdevs (stripes) from 1 through 8 (all backed buy paritions on the same logical disk on the 3511) there was no real change in sequential write, random write, or random read performance. Sequential read performance did show a drop from 216 MiB/sec at 1 vdev to 180 MiB/sec. at 8 vdevs. This was about as expected. As I changed the number of mirro components things got interesting. Keep in mind that I only have one 3511 for testing right now, I had to use partitions from two other production 3511's to get three mirror components on different arrays. As expected, as I went from 1 to 2 to 3 mirror components the write performance did not change, but the read performance was interesting... see below: read performance mirrors sequential random 1 174 MiB/sec. 23 MiB/sec. 2 229 MiB/sec. 30 MiB/sec. 3 223 MiB/sec. 125 MiB/sec. What they heck happened here ? 1 to 2 mirrors saw a large increase in sequential read perfromance and from 2 to 3 mirrors show a HUGE increase in random read performance. It feels like the behavior of the zfs code changed between 2 and 3 mirrors for the random read data. I can't explain this. It may require a detailed understanding of the hardware configuration to identify the potential bottleneck. The ZFS mirroring code doesn't care how many mirrors there are, it just goes through the list. If the performance is not symmetrical from all sides of the mirror, then YMMV. Now to investigate further, I tried multiple mirrors components on the same array (my test 3511), not that you would do this in production, but I was curious what would happen. In this case the throughput degraded across the board as I added mirror components, as one would expect. In the random read case the array was delivering less overall performance than it was when it was one part of the earlier test (16 MiB/sec. combined vs. 1/3 of 125 MiB/sec.) See sheet 7 of http://www.ilk.org/~ppk/Geek/throughput-summary.ods for these test results. Sheet 8 is the last test I did last night, using the NRAID logical disk type to try to get the 3511 to pass a disk through to zfs, but get the advantage of the cache on the 3511. I'm not sure what to read into those numbers. I read it as the single array, as configured, with 10+1 RAID-5 can deliver around 130 random read IOPS @ 128 KB. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs: questions on ARC membership based on type/ordering of Reads/Writes
I am trying to understand the ARC's behavior based on different permutations of (a)sync Reads and (a)sync Writes. thank you, in advance o does the data for a *sync-write* *ever* go into the ARC? eg, my understanding is that the data goes to the ZIL (and the SLOG, if present), but how does it get from the ZIL to the ZIO layer? eg, does it go to the ARC on its way to the ZIO ? o if the sync-write-data *does* go to the ARC, does it go to the ARC *after* it is written to the ZIL's backing-store, or does the data go to the ZIL and the ARC in parallel ? o if a sync-write's data goes to the ARC and ZIL *in parallel*, then does zfs prevent an ARC-hit until the data is confirmed to be on the ZIL's nonvolatile media (eg, disk-platter or SLOG) ? or could a Read get an ARC-hit on a block *before* it's written to zil's backing-store? o is the DMU where the Serialization of transactions occurs? o if an async-Write for block-X hits the Serializer before a Read for block-X hits the Serializer, i am assuming the Read can pass the async-Write; eg, the Read is *not* pended behind the async-write. however, if a Read hits the Serializer after a *sync*-write, then i'm assuming the Read is pended until the sync-write is written to the ZIL's nonvolatile media. o if a Read passes an async-write, then i'm assuming the Read can be satisfied by either the arc, l2arc, or disk. o it's stated that the L2ARC is for random-reads. however, there's nothing to prevent the L2ARC from containing blocks derived from *sequential*-reads, right ? also, blocks from async-writes can also live in l2arc, right? how about sync-writes ? o is the l2arc literally simply a *larger* ARC? eg, does the l2arc obey the normal cache property where everything that is in the L1$ (eg, ARC) is also in the L2$ (eg, l2arc) ? (I have a feeling that the set-theoretic intersection of ARC and L2ARC is empty (for some reason). o does the l2arc use the ARC algorithm (as the name suggests) ? thank you, /andrew Solaris RPE ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] proposal partial/relative paths for zfs(1)
Is there still any interest in this? I've done a bit of hacking (then searched for this thread - I picked -P instead of -c)... $ zfs get -P compression,dedup /var NAMEPROPERTY VALUE SOURCE rpool/ROOT/zfstest compression on inherited from rpool/ROOT rpool/ROOT/zfstest dedupoffdefault $ pfexec zfs snapshot -P @now Creating snapshot rpool/export/h...@now Of course create/mkdir would make it into the eventual implementation as well. For those missing this thread in their mailboxes, the conversation is archived at http://mail.opensolaris.org/pipermail/zfs-discuss/2008-July/019762.html. http://mail.opensolaris.org/pipermail/zfs-discuss/2008-July/019762.html Mike On Thu, Jul 10, 2008 at 4:42 AM, Darren J Moffat darren.mof...@sun.com wrote: I regularly create new zfs filesystems or snapshots and I find it annoying that I have to type the full dataset name in all of those cases. I propose we allow zfs(1) to infer the part of the dataset name upto the current working directory. For example: Today: $ zfs create cube/builds/darrenm/bugs/6724478 With this proposal: $ pwd /cube/builds/darrenm/bugs $ zfs create 6724478 Both of these would result in a new dataset cube/builds/darrenm/6724478 This will need some careful though about how to deal with cases like this: $ pwd /cube/builds/ $ zfs create 6724478/test What should that do ? should it create cube/builds/6724478 and cube/builds/6724478/test ? Or should it fail ? -p already provides some capbilities in this area. Maybe the easiest way out of the ambiquity is to add a flag to zfs create for the partial dataset name eg: $ pwd /cube/builds/darrenm/bugs $ zfs create -c 6724478 Why -c ? -c for current directory -p partial is already taken to mean create all non existing parents and -r relative is already used consistently as recurse in other zfs(1) commands (as well as lots of other places). Alternately: $ pwd /cube/builds/darrenm/bugs $ zfs mkdir 6724478 Which would act like mkdir does (including allowing a -p and -m flag with the same meaning as mkdir(1)) but creates datasets instead of directories. Thoughts ? Is this useful for anyone else ? My above examples are some of the shorter dataset names I use, ones in my home directory can be even deeper. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practices for zpools on zfs
On 2009-Nov-24 14:07:06 -0600, Mike Gerdts mger...@gmail.com wrote: On Tue, Nov 24, 2009 at 1:39 PM, Richard Elling richard.ell...@gmail.com wrote: Also, the performance of /dev/*random is not very good. So prestaging lots of random data will be particularly challenging. This depends on the random number generation algorithm used in the kernel. I get 50MB/sec out of FreeBSD on 3.2GHz P4 (using Yarrow). In any case, you don't need crypto-grade random numbers, just data that is different and uncompressible - there are lots of relatively simple RNGs that can deliver this with far greater speed. I was thinking that a bignum library such as libgmp could be handy to allow easy bit shifting of large amounts of data. That is, fill a 128 KB buffer with random data then do bitwise rotations for each successive use of the buffer. Unless my math is wrong, it should allow 128 KB of random data to be write 128 GB of data with very little deduplication or compression. A much larger data set could be generated with the use of a 128 KB linear feedback shift register... This strikes me as much harder to use than just filling the buffer with 8/32/64-bit random numbers from a linear congruential generator, lagged fibonacci generator, mersenne twister or even random(3) http://en.wikipedia.org/wiki/List_of_random_number_generators -- Peter Jeremy pgpO9mAWzbb7x.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs: questions on ARC membership based on type/ordering of Reads/Writes
On Nov 25, 2009, at 11:55 AM, andrew.r...@sun.com wrote: I am trying to understand the ARC's behavior based on different permutations of (a)sync Reads and (a)sync Writes. thank you, in advance o does the data for a *sync-write* *ever* go into the ARC? always eg, my understanding is that the data goes to the ZIL (and the SLOG, if present), but how does it get from the ZIL to the ZIO layer? ZIL is effectively write-only. It is only read when the pool is imported. eg, does it go to the ARC on its way to the ZIO ? ARC is the cache for buffering data. o if the sync-write-data *does* go to the ARC, does it go to the ARC *after* it is written to the ZIL's backing-store, or does the data go to the ZIL and the ARC in parallel ? A sync write returns when the data is written to the ZIL. An async write returns when the data is in the ARC, and later the unwritten contents of the ARC are pushed to the pool when the transaction group is committed. o if a sync-write's data goes to the ARC and ZIL *in parallel*, then does zfs prevent an ARC-hit until the data is confirmed to be on the ZIL's nonvolatile media (eg, disk-platter or SLOG) ? or could a Read get an ARC-hit on a block *before* it's written to zil's backing-store? In my mind, the ARC and ZIL are orthogonal. o is the DMU where the Serialization of transactions occurs? Serialization? o if an async-Write for block-X hits the Serializer before a Read for block-X hits the Serializer, i am assuming the Read can pass the async-Write; eg, the Read is *not* pended behind the async-write. however, if a Read hits the Serializer after a *sync*-write, then i'm assuming the Read is pended until the sync-write is written to the ZIL's nonvolatile media. o if a Read passes an async-write, then i'm assuming the Read can be satisfied by either the arc, l2arc, or disk. I think you are asking if write order is preserved. The answer is yes. o it's stated that the L2ARC is for random-reads. however, there's nothing to prevent the L2ARC from containing blocks derived from *sequential*-reads, right ? also, blocks from async-writes can also live in l2arc, right? how about sync-writes ? Blocks which are not yet committed to the pool are locked in the ARC so they can't be evicted. Once committed, the lock is removed. o is the l2arc literally simply a *larger* ARC? eg, does the l2arc obey the normal cache property where everything that is in the L1$ (eg, ARC) is also in the L2$ (eg, l2arc) ? (I have a feeling that the set-theoretic intersection of ARC and L2ARC is empty (for some reason). No. The L2ARC is not in the datapath between the ARC and media. Further, data is not evicted from the ARC into the L2ARC. Rather, the L2ARC is filled from data near the eviction ends of the MRU and MFU lists. The movement of data to the L2ARC is throttled and grouped in sequence, improving efficiency for devices which like large writes, such as read-optimized flash. Think of it this way. Data which is in the ARC is fed into the L2ARC. If the data is later evicted from the ARC, it can still live in the L2ARC. When the L2ARC has lower read latency then the pool's media, then it can improve performance because the data can be read from L2ARC instead of the pool. This fits the general definition of a cache, but does not work the same way as multilevel CPU caches. o does the l2arc use the ARC algorithm (as the name suggests) ? Yes, but it really isn't separate from the ARC, from a management point of view. To fully understand it, you need to know about how the metadata for each buffer in the ARC is managed. This will introduce the concept of the ghosts, and the L2ARC is a simple extension. The comments in the source are nicely descriptive, and you might consider reading them through once, even if you don't dive into the code itself: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-raidz - simulate disk failure
[verify on real hardware and share results] Agree 110%. Good :) Yanking disk controller and/or power cables is an easy and obvious test. The problem is that yanking a disk tests the failure mode of yanking a disk. Yes, but the point is that it's a cheap and easy test, so you might as well do it -- just beware of what it does, and most importantly does not, tell you. It's a valid scenario to test regardless, you want to be sure that you can yank a disk to replace it, without a bus hang or other hotplug problem on your hardware. Testing scenarios that involve things like disk firmware behaviour in response to bad reads is harder - If you wish to test the failure modes you are likely to see, then you need a more sophisticated test rig that will emulate a device and inject the sorts of faults you expect. This is one reason I like to keep faulty disks! :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-raidz - simulate disk failure
On Nov 25, 2009, at 4:43 PM, Daniel Carosone wrote: [verify on real hardware and share results] Agree 110%. Good :) Yanking disk controller and/or power cables is an easy and obvious test. The problem is that yanking a disk tests the failure mode of yanking a disk. Yes, but the point is that it's a cheap and easy test, so you might as well do it -- just beware of what it does, and most importantly does not, tell you. It's a valid scenario to test regardless, you want to be sure that you can yank a disk to replace it, without a bus hang or other hotplug problem on your hardware. The next problem is that although a spec might say that hot-plugging works, that doesn't mean the implementers support it. To wit, there are well known SATA controllers that do not support hot plug. So what good is the test if the hardware/firmware is known to not support it? Speaking practically, do you evaluate your chipset and disks for hotplug support before you buy? Testing scenarios that involve things like disk firmware behaviour in response to bad reads is harder - If you wish to test the failure modes you are likely to see, then you need a more sophisticated test rig that will emulate a device and inject the sorts of faults you expect. This is one reason I like to keep faulty disks! :) Me too. I still have a SATA drive that breaks POST for every mobo I've come across. Wanna try hot plug with it? :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-raidz - simulate disk failure
On Wed, Nov 25 at 16:43, Daniel Carosone wrote: The problem is that yanking a disk tests the failure mode of yanking a disk. Yes, but the point is that it's a cheap and easy test, so you might as well do it -- just beware of what it does, and most importantly does not, tell you. It's a valid scenario to test regardless, you want to be sure that you can yank a disk to replace it, without a bus hang or other hotplug problem on your hardware. Agreed. It's also a very effective way of preventing your drive from responding to commands, to test how the system behaves when a drive stops responding. Some significant percentage of device failures will look similar. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads up: SUNWzfs-auto-snapshot obsoletion in snv 128
So we also need a txg dirty or similar property to be exposed from the kernel. Or not.. if you find this condition, defer, but check again in a minute (really, after a full txg_interval has passed) rather than on the next scheduled snapshot. on that next check, if the txg has advanced again, snapshot. if not, defer until the next scheduled snapshot as usual. Yes, the txg may now be dirty this second time around - but it's after the snapshot was due, so these writes will be collected in the next snapshot. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-raidz - simulate disk failure
Speaking practically, do you evaluate your chipset and disks for hotplug support before you buy? Yes, if someone else has shared their test results previously. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X45xx storage vs 7xxx Unified storage
et == Erik Trimble erik.trim...@sun.com writes: et I'd still get the 7310 hardware. et Worst case scenario is that you can blow away the AmberRoad okay but, AIUI he was saying pricing is 6% more for half as much physical disk. This is also why it ``uses less energy'' while supposedly filling the same role: fishworks clustering is based on SAS multi-initiator, on SAS fan...uh,...fan-in? switches, while OP's home-rolled cluster plan was based on copying the data to another zpool. remember pricing is based on ``market forces'': it's not dumb, is the opposite of dumb, but...under ``market forces'' pricing if you are paying for clever-schemes you can't use, YHL. pgplSxaec58ln.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X45xx storage vs 7xxx Unified storage
Miles Nordin wrote: et == Erik Trimble erik.trim...@sun.com writes: et I'd still get the 7310 hardware. et Worst case scenario is that you can blow away the AmberRoad okay but, AIUI he was saying pricing is 6% more for half as much physical disk. This is also why it ``uses less energy'' while supposedly filling the same role: fishworks clustering is based on SAS multi-initiator, on SAS fan...uh,...fan-in? switches, while OP's home-rolled cluster plan was based on copying the data to another zpool. remember pricing is based on ``market forces'': it's not dumb, is the opposite of dumb, but...under ``market forces'' pricing if you are paying for clever-schemes you can't use, YHL. No, 6% LESS for the 7310 solution, vs the dual x4540 solution. The key here is Usable disk space. Yes, the X4540 comes with 2x the disk space, but having to cluster them via non-shared storage, you effectively eliminate that advantage. Not to mention that expanding a clustered X4540 either means you have to buy 2x the required storage (i.e. attach another array to each x4540), or you do the exact same thing as with a 7310 (i.e. dual-attach an array to both). You certainly are paying some premium for the A-R software; however, I was stating the worst-case scenario where he finds he can't make use of the A-R software. He's still left with a hardware solution that is superior to the dual X4540 (in my opinion). That is, software aside, my opinion is that a clustered X4140 with shared J4400 chassis is a better idea than redundant X4540 setup. With or without the AR software. The AR software just makes the configuration of the 7310 extremely simple, which is no small win in and of itself. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing log with SSD on Sol10 u8
Interesting. Unfortunately, I can not zpool offline, nor zpool detach, nor zpool remove the existing c6t4d0s0 device. I thought perhaps we could boot something newer than b125 [*1] and I would be able to remove the slog device that is too big. The dev-127.iso does not boot [*2] due to splashimage, so I had to edit the ISO to remove that for booting. After booting with -B console=ttya, I find that it can not add the /dev/dsk entries for the 24 HDDs, since / is on a too-small ramdisk. Disk-full messages ensue. Yay! After I have finally imported the pools, without upgrading (since I have to boot back to Sol 10 u8 for production), I attempt to remove the slog that is no longer needed: # zpool remove zpool1 c6t4d0s0 cannot remove c6t4d0s0: pool must be upgrade to support log removal Sigh. Lund [*1] http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286 [*2] http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6739497 -- Jorgen Lundman | lund...@lundman.net Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss