Re: [zfs-discuss] Replacement for X25-E
I don't think the 311 has any over-provisioning (other than the 7% from GB - GiB conversion). I believe it is an X25-E with only 5 channels populated. The upcoming enterprise models are MLC based and have greater over-provisioning AFAIK. The 20GB 311 only costs ~ $100 though. The 100GB Intel 710 costs ~ $650. The 311 is a good choice for home or budget users, and it seems that the 710 is much bigger than it needs to be for slog devices. I think 311 looks suitable replacement, as in you can put four 311's instead of 2 X25-E's as slog (when it comes to pricing), going to test it out. Thanks to you all. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacement for X25-E
Can you rank your priorities: + cost/IOPS + cost + latency + predictable latency + HA-cluster capable There are quite a number of devices available now, at widely varying costs, application, and performance. -- richard I'd say price range around same than X25-E was, main priorities being predictable latency and performance. Also write wear shouldn't get an issue when writing 150MB/s 24/7 365. Thanks Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Replacement for X25-E
Hi, I was wondering do you guys have any recommendations as replacement for Intel X25-E as it is being EOL'd? Mainly as for log device. With kind regards Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about write balancing
To me it seems that writes are not directed properly to the devices that have most free space - almost exactly the opposite. The writes seem to go to the devices that have _least_ free space, instead of the devices that have most free space. The same effect that can be seen in these 60s averages can also be observed in a shorter timespan, like a second or so. Is there something obvious I'm missing? Not sure how OI should behave, I've managed to even writes space usage between vdevs by bringing device offline in vdev you don't want to writes end up to. If you have degraded vdev in your pool, zfs will try not to write there, and this may be the case here as well as I don't see zpool status output. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
Have you determined this is not 7000208 as it sounds much like it; You could run /usr/sbin/lockstat -HcwP -n 10 -x aggrate=10hz -D 20 -s 40 sleep 2 /usr/sbin/lockstat -CcwP -n 10 -x aggrate=10hz -D 20 -s 40 sleep 2 to find out hottest callers (space_map_load,kmem_cache_free) while issue is on. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Donald Stahl Sent: 9. kesäkuuta 2011 6:27 To: Ding Honghui Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Wired write performance problem There is snapshot of metaslab layout, the last 51 metaslabs have 64G free space. After we added all the disks to our system we had lots of free metaslabs- but that didn't seem to matter. I don't know if perhaps the system was attempting to balance the writes across more of our devices but whatever the reason- the percentage didn't seem to matter. All that mattered was changing the size of the min_alloc tunable. You seem to have gotten a lot deeper into some of this analysis than I did so I'm not sure if I can really add anything. Since 10u8 doesn't support that tunable I'm not really sure where to go from there. If you can take the pool offline, you might try connecting it to a b148 box and see if that tunable makes a difference. Beyond that I don't really have any suggestions. Your problem description, including the return of performance when freeing space is _identical_ to the problem we had. After checking every single piece of hardware, replacing countless pieces, removing COMSTAR and other pieces from the puzzle- the only change that helped was changing that tunable. I wish I could be of more help but I have not had the time to dive into the ZFS code with any gusto. -Don ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
Hi, also see; http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg45408.html We hit this with Sol11 though, not sure if it's possible with sol10 Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Ding Honghui Sent: 8. kesäkuuta 2011 6:07 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] Wired write performance problem Hi, I got a wired write performance and need your help. One day, the write performance of zfs degrade. The write performance decrease from 60MB/s to about 6MB/s in sequence write. Command: date;dd if=/dev/zero of=block bs=1024*128 count=1;date The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks. The OS is Solaris 10U8, zpool version 15 and zfs version 4. I run Dtrace to trace the write performance: fbt:zfs:zfs_write:entry { self-ts = timestamp; } fbt:zfs:zfs_write:return /self-ts/ { @time = quantize(timestamp-self-ts); self-ts = 0; } It shows value - Distribution - count 8192 | 0 16384 | 16 32768 | 3270 65536 |@@@ 898 131072 |@@@ 985 262144 | 33 524288 | 1 1048576 | 1 2097152 | 3 4194304 | 0 8388608 |@180 16777216 | 33 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 1 536870912 | 1 1073741824 | 2 2147483648 | 0 4294967296 | 0 8589934592 | 0 17179869184 | 2 34359738368 | 3 68719476736 | 0 Compare to a working well storage(1 MD3000), the max write time of zfs_write is 4294967296, it is about 10 times faster. Any suggestions? Thanks Ding ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
failed: space_map_load(sm, zfs_metaslab_ops, SM_FREE, smo, spa-spa_meta_objset) == 0, file ../zdb.c, line 571, function dump_metaslab Is this something I should worry about? uname -a SunOS E55000 5.11 oi_148 i86pc i386 i86pc Solaris I thought we were talking about solaris 11 express, not oi? Anyway, no idea about how openindiana should work or not. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express
Sync was disabled on the main pool and then let to inherrit to everything else. The reason for disabled this in the first place was to fix bad NFS write performance (even with Zil on an X25e SSD it was under 1MB/s). I've also tried setting the logbias to throughput and latency but they both perform around the same level. Thanks -Matt I believe you're hitting bug 7000208: Space map trashing affects NFS write throughput. We also did, and it did impact iscsi as well. If you have enough ram you can try enabling metaslab debug (which makes problem vanish); # echo metaslab_debug/W1 | mdb -kw And calculating amount of ram needed: /usr/sbin/amd64/zdb -mm poolname /tmp/zdb-mm.out awk '/segments/ {s+=$2}END {printf(sum=%d\n,s)}' zdb_mm.out 93373117 sum of segments 16 VDEVs 116 metaslabs 1856 metaslabs in total 93373117/1856 = 50308 average number of segments per metaslab 50308*1856*64 5975785472 5975785472/1024/1024/1024 5.56 = 5.56 GB Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What drives?
So, does anyone know which drives to choose for the next setup? Hitachis look good so far, perhaps also seagates, but right now, I'm dubious about the blacks. Hi! I'd go for WD RE edition. Blacks and Greens are for desktop use and therefore lack proper TLER settings and have useless power saving features that could induce errors and mysterious slowness. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very bad ZFS write performance. Ok Read.
I noticed recently that write rate has dropped off and through testing now I am getting 35MB/sec writes. The pool is around 50-60% full. I am getting a CONSTANT 30-35% kernel cpu utilisation, even if the machine is idle. I do not know if this was the case when the write performance was better. I have tried reading from the server to a HDD on a windows client and I get 50+MB/sec which is probably the max that that HDD can sustain on a write. Hi, do you have your zfs prefetch turned on or off? Turning prefetch off makes comstar iscsi shares unusable in Solaris 11 Express while it might work fine in osol. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very bad ZFS write performance. Ok Read.
On the other hand, that will only matter for reads. And the complaint is writes. Actually, it also affects writes. (due checksum reads?) Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool does not like iSCSI ?
Do you know if these bugs are fixed in Solaris 11 Express ? It says it was fixed in snv_140, and S11E is based on snv_151a, so it should be in: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6907687 I can confirm it works, iscsi zpools seem to work very happily now. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID-Z/mirror hybrid allocator
2. If you have an existing RAIDZ pool and upgrade to b151a, you would need to upgrade the pool version to use this feature. In this case, newly written metadata would be mirrored. Hi, And if one creates raid-z3 pool would meta-data be a 3-way mirror as well? Also, how are devices determined where metadata is mirrored? Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] problem adding second MD1000 enclosure to LSI 9200-16e
Any suggestions? Is there some sort of boot procedure, in order to get the system to recognize the second enclosure without locking up? Is there a special way to configure one of these LSI boards? It should just work, make sure you connect it right way and both JBODs are not in split mode (which does not allow daisy chaining). Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] RAID-Z/mirror hybrid allocator
Hi, I'm referring to; http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6977913 It should be in Solaris 11 Express, has anyone tried this? How this is supposed to work? Any documentation available? Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express
Does Oracle support Solaris 11 Express in production systems? -- richard Yes, You need Premier support plan from Oracle for that. Afaik, sol11 express is production ready, and is going to be updated to real Solaris 11, and is supported even with non-oracle hardware if you have the money (and certified system). Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is opensolaris support ended?
Thanks for your help. I would check this out. Hi, yes. No new support plans have been available for a while. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Maximum zfs send/receive throughput
I'm wondering if #6975124 could be the cause of my problem, too. there are several zfs send (and receive) related issues with 111b. You might seriously want to consider upgrading to more recent opensolaris (134) or openindiana Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool does not like iSCSI ?
Again: you may encounter this case only if... ... you're running some recent kernel patch level (in our case 142909-17) ... *and* you have placed zpools on both iscsi and non-iscsi devices. Witnessed same behavior with osol_134 but it seems to be fixed in 147 atleast. No idea about Solaris though Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
Add about 50% to the last price list from Sun und you will get the price it costs now ... Seems oracle does not want to sell its hardware so much, several month delays with sales rep providing prices and pricing nowhere close to its competitors. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
That's precisely what I'm experiencing. System still responds to ping. Anything that was already running in memory via network stays alive (cron jobs continue to run) but remote access is impossible (ssh, vnc, even local physical console...) And eventually the system will stop completely. Hi, Broadcom issues come out as loss of network connectivity, ie. system stops responding to ping. This is different issue, it's like system runs out of memory or looses its system disks (which we have seen lately) Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
You are asking for a world of hurt. You may luck out, and it may work great, thus saving you money. Take my example for example ... I took the safe approach (as far as any non-sun hardware is concerned.) I bought an officially supported dell server, with all dell blessed and solaris supported components, with support contracts on both the hardware and software, fully patched and updated on all fronts, and I am getting system failures approx once per week. I have support tickets open with both dell and oracle right now ... Have no idea how it's all going to turn out. But if you have a problem like mine, using unsupported hardware, you have no alternative. You're up a tree full of bees, naked, with a hunter on the ground trying to shoot you. And IMHO, I think the probability of having a problem like mine is higher when you use the unsupported hardware. But of course there's no definable way to quantize that belief. My advice to you is: buy the supported hardware, and the support contracts for both the hardware and software. But of course, that's all just a calculated risk, and I doubt you're going to take my advice. ;-) Any other feasible alternatives for Dell hardware? Wondering, are these issues mostly related to Nehalem-architectural problems, eg. c-states. So is there anything good in switching hw vendor? HP anyone? Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
I have a Dell R710 which has been flaky for some time. It crashes about once per week. I have literally replaced every piece of hardware in it, and reinstalled Sol 10u9 fresh and clean. I am wondering if other people out there are using Dell hardware, with what degree of success, and in what configuration? The failure seems to be related to the perc 6i. For some period around the time of crash, the system still responds to ping, and anything currently in memory or running from remote storage continues to function fine. But new processes that require the local storage ... Such as inbound ssh etc, or even physical login at the console ... those are all hosed. And eventually the system stops responding to ping. As soon as the problem starts, the only recourse is power cycle. I can't seem to reproduce the problem reliably, but it does happen regularly. Yesterday it happened several times in one day, but sometimes it will go 2 weeks without a problem. Again, just wondering what other people are using, and experiencing. To see if any more clues can be found to identify the cause. Hi, we've been running opensolaris on Dell R710 with mixed results, some work better than others and we've been struggling with same issue as you are with latest servers. I suspect somekind powersaving issue gone wrong, system disks goes to sleep and never wake up or something similar. Personally, I cannot recommend using them with solaris, support is not even close to what it should be. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Running on Dell hardware?
How consistent are your problems? If you change something and things get better or worse, will you be able to notice? Right now, I think I have improved matters by changing the Perc to WriteThrough instead of WriteBack. Yesterday the system crashed several times before I changed that, and afterward, I can't get it to crash at all. But as I said before ... Sometimes the system goes 2 weeks without a problem. Do you have all your disks configured as individual disks? Do you have any SSD? WriteBack or WriteThrough? I believe issues are not related to perc, as we use sas 6ir with system disks and disks are showing up as individual disks. System has been crashing with and without (i/o) load, so far it's been running best with all extra pci-e cards removed (10Gbps nic, sas 5e controllers), uptime almost two days. There's no apparent reason what triggers the crash, it did crash very frequently during one day and now it seems more stable. (sunspots anyone?) We had SSD's at start, but removed them during testing, no effect there. Somehow, all this is starting to remind me about Broadcom NIC issues. Different (not fully supported) hardware revision causing issues? Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup status
Hi! Hi all I just tested dedup on this test box running OpenIndiana (147) storing bacula backups, and did some more testing on some datasets with ISO images. The results show so far that removing 30GB deduped datasets are done in a matter of minutes, which is not the case with 134 (which may take hours). The tests also show that the write speed to the pool is low, very low, if dedup is enabled. This is a box with a 3GHz core2duo, 8 gigs of RAM, eight 2TB drives and a 80GB x25m for the SLOG (4 gigs) and L2ARC (the rest of it). So far I will conclude that dedup should be useful if storage capacity is crucial, but not if performance is taken into concideration. Mind, this is not a high-end box, but still, I think the numbers show something Hi, it is probably due you have quite low amount of ram. I have similar setup, 10TB dataset that can handle 100MB/s writes easily, system has 24GB of ram. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pools inside pools
Yes. But what is enough reserved free memory? If you need 1Mb for a normal configuration you might need 2Mb when you are doing ZFS on ZFS. (I am just guessing). This is the same problem as mounting an NFS server on itself via NFS. Also not supported. The system has shrinkable caches and so on, but that space will sometimes run out. All of it. There is also swap to use, but if that is on ZFS These things are also very hard to test. I was able to see opensolaris snv_134 to become unresponsive due lack of memory with nested pool configuration today. It took around 12hours issuing writes around 1,2-1,5GB/s with system that had 48GB of ram. Anyway, setting zfs_arc_max in /etc/system seemed to do the trick, seems to behave like expected even under heavier load. Performance is actually pretty good. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup testing?
On Sat, Sep 25, 2010 at 10:19 AM, Piotr Jasiukajtis est...@gmail.com wrote: AFAIK that part of dedup code is not changed in b147. I think I remember seeing that there was a change made in 142 that helps, though I'm not sure to what extent. -B OI seemed to behave much better than 134 in low disk space situation with dedup turned on after server crashed during (terabytes of) snapshot destroy. import took some time but it did not block IO and most time consuming part was mounting datasets, already mounted datasets could be used during import too. Also performance is a lot better. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pools inside pools
Isn't this a matter of not keeping enough free memory as a workspace? By free memory, I am referring to unallocated memory and also recoverable main memory used for shrinkable read caches (shrinkable by discarding cached data). If the system keeps enough free and recoverable memory around for workspace, why should the deadlock case ever arise? Slowness and page swapping might be expected to arise (as a result of a shrinking read cache and high memory pressure), but deadlocks too? It sounds like deadlocks from the described scenario indicate the memory allocation and caching algorithms do not perform gracefully in the face of high memory pressure. If the deadlocks do not occur when different memory pools are involved (by using a second computer), that tells me that memory allocation decisions are playing a role. Additional data should not be accepted for writes when the system determines memory pressure is so high that it it may not be able to flush everything to disk. Here is one article about memory pressure (on Windows, but the issues apply cross-OS): http://blogs.msdn.com/b/slavao/archive/2005/02/01/364523.aspx (How does virtualization fit into this picture? If both OpenSolaris systems are actually running inside of different virtual machines, on top of the same host, have we isolated them enough to allow pools inside pools without risk of deadlocks? ) I haven't noticed any deadlock issues so far in low memory conditions when doing nested pools (in replicated configuration), atleast in snv134. Maybe I haven't tried hard enough, anyway, wouldn't log-device in innerpool help in this situation? Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pools inside pools
What is an example of where a checksummed outside pool would not be able to protect a non-checksummed inside pool? Would an intermittent RAM/motherboard/CPU failure that only corrupted the inner pool's block before it was passed to the outer pool (and did not corrupt the outer pool's block) be a valid example? If checksums are desirable in this scenario, then redundancy would also be needed to recover from checksum failures. That is excellent point also, what is the point for checksumming if you cannot recover from it? At this kind of configuration one would benefit performance-wise not having to calculate checksums again. Checksums in outer pools effectively protect from disk issues, if hardware fails so data is corrupted isn't outer pools redundancy going to handle it for inner pool also. Only thing comes to mind is that IF something happens to outerpool, innerpool is not aware anymore of possibly broken data which can lead issues. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Pools inside pools
Hi, I'm asking for opinions here, any possible disaster happening or performance issues related in setup described below. Point being to create large pool and smaller pools within where you can monitor easily iops and bandwidth usage without using dtrace or similar techniques. 1. Create pool # zpool create testpool mirror c1t1d0 c1t2d0 2. Create volume inside a pool we just created # zfs create -V 500g testpool/testvolume 3. Create pool from volume we just did # zpool create anotherpool /dev/zvol/dsk/testpool/testvolume After this, anotherpool can be monitored via zpool iostat nicely and compression can be used in testpool to save resources without having compression effect in anotherpool. zpool export/import seems to work, although flag -d needs to be used, are there any caveats in this setup? How writes are handled? Is it safe to create pool consisting several ssd's and use volumes from it as log-devices? Is it even supported? Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pools inside pools
Such configuration was known to cause deadlocks. Even if it works now (which I don't expect to be the case) it will make your data to be cached twice. The CPU utilization will also be much higher, etc. All in all I strongly recommend against such setup. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! Well, CPU utilization can be tuned downwards by disabling checksums in inner pools as checksumming is done in main pool. I'd be interested in bug id's for deadlock issues and everything related. Caching twice is not an issue, prefetching could be and it can be disabled I don't understand what makes it difficult for zfs to handle this kind of setup. Main pool (testpool) should just allow any writes/reads to/from volume, not caring what they are, where as anotherpool would just work as any other pool consisting of any other devices. This is quite similar setup to iscsi-replicated mirror pool, where you have redundant pool created from iscsi volumes locally and remotely. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pools inside pools
Actually, the mechanics of local pools inside pools is significantly different than using remote volumes (potentially exported ZFS volumes) to build a local pool from. I don't see how, I'm referring to method where hostA shares local iscsi volume to hostB where volume is being mirrored with zfs to its local volume that is shared through iscsi, resulting sync mirrored pool. And, no, you WOULDN'T want to turn off the inside pool's checksums. You're assuming that this would be taken care of by the outside pool, but that's a faulty assumption, since the only way this would happen would be if the pools somehow understood they were being nested, and thus could bypass much of the caching and I/O infrastructure related to the inner pool. Good point. Checksums it is then. Cacheing is also a huge issue, since ZFS isn't known for being memory-slim, and as caching is done (currently) on a per-pool level, nested pools will consume significantly more RAM. Without caching the inner pool, performance is going to suck (even if some blocks are cached in the outer pool, that pool has no way to do look-ahead, nor other actions). The nature of delayed writes can also wreck havoc with caching at both pool levels. Well, again, I don't see how nested pool would consume more RAM than invidual another pool created from dedicated disks. Read caching takes place twice, but I don't see it much of as problem nowadays, just double the ram. (ofcourse, depending on workload) look-ahead (prefetch?) hasn't work very well anyway so it's gong to be disabled, cache hit isn't great (worth it) on any workload. Also, write caching needs to be benchmarked, but I'd say, if it works like it should, there is no issues there, have to test it out thoroughly though. Stupid filesystems have no issues with nesting, as they're not doing anything besides (essentially) direct I/O to the underlying devices. UFS doesn't have its own I/O subsystem, nor do things like ext* or xfs. However, I've yet to see any modern filesystem do well with nesting itself - there's simply too much going on under the hood, and without being nested-aware (i.e. specifically coding the filesystem to understand when it's being nested), much of these backend optimizations are a recipe for conflict . -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Thanks for your thoughts, if issues are performance related, they can be dealt with to some extent, more I'm worrying if there is still deadlock issues or other general stability issues to consider, haven't found anything useful from bugtraq yet though. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pools inside pools
If you write to a zvol on a different host (via iSCSI) those writes use memory in a different memory pool (on the other computer). No deadlock. I would expect in a usual configuration that one side of a mirrored iSCSI-based pool would be on the same host as it's underlying zvol's pool. Thats what I was after. Would using log-device in inner pool make things different then? If presumed workload is eg. serving nfs. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver that never finishes
Hi, The drives and the chassis are fine, what I am questioning is how can it be resilvering more data to a device than the capacity of the device? If data on pool has changed during resilver, resilver counter will not update accordingly, and it will show resilvering 100% for needed time to catch up. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
On Sep 9, 2010, at 8:27 AM, Fei Xu twinse...@hotmail.com wrote: This might be the dreaded WD TLER issue. Basically the drive keeps retrying a read operation over and over after a bit error trying to recover from a read error themselves. With ZFS one really needs to disable this and have the drives fail immediately. Check your drives to see if they have this feature, if so think about replacing the drives in the source pool that have long service times and make sure this feature is disabled on the destination pool drives. -Ross It might be due tler-issues, but I'd try to pin greens down to SATA1-mode (use jumper, or force via controller). It might help a bit with these disks, although these are not really suitable disks for any use in any raid configurations due tler issue, which cannot be disabled in later firmware versions. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup status
Hi, its getting better, I believe its no longer single threaded after 135? (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6922161) but still waiting for major bug fix, http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6924824 It should be fixed before Release afaik. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes
- Poweroff with USB drive connected or removed, Solaris will not boot unless USB drive is connected, and in some cases need to be attached to the exact same USB port when last attached. Is this a bug ? Possibly hitting this? http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6923585 Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] b134 pool borked!
Hi, It definitely seems like hardware-related issue as panics related to common tools like format isn’t to be expected. Anyhow. You might want to start to get all your disks show up in iostat / cfgadm before trying to import pool. You should replace controller if you have not already done so, and RAM should be all ok I guess? Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Workaround for mpt timeouts in snv_127
... I have identified the culprit is the Western Digital drive WD2002FYPS-01U1B0. It's not clear if they can fix it in firmware, but Western Digital is replacing my drives. Feb 17 04:45:10 thecratewall scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Feb 17 04:45:10 thecratewall scsi: [ID 365881 kern.info] /p...@0,0/pci15ad,7...@15/pci1000,3...@0 (mpt_sas0): Feb 17 04:45:10 thecratewall Log info 0x31110630 received for target 13. Feb 17 04:45:10 thecratewall scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Hi, do you have disks connected in sata1/2? With WD2003FYYS-01T8B0/WD20EADS-00S2B0/WD1001FALS-00J7B1/WD1002FBYS-01A6B0 these timeouts are to be expected if disk is in SATA2 mode, we've get rid of these timeouts after forcing disks in SATA1-mode with jumpers, now they only appear when disk is having real issues and needs to be replaced. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Workaround for mpt timeouts in snv_127
No, why are they to be expected with SATA2 mode? Is the defect specific to the SATA2 circuitry? I guess it could be a temporary workaround provided they would eventually fix the problem in firmware, but I'm getting new drives, so I guess I can't complain :-) Probably your new disks do this too, I really don't know whats with flawkey sata2 but I'd be quite sure it would fix your issues. Performance drop is not even noticeable, so it's worth a try. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?
Seems like this issue only occurs when MSI-X interrupts are enabled for the BCM5709 chips, or am I reading it wrong? If I type 'echo ::interrupts | mdb -k', and isolate for network-related bits, I get the following output: IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s) 36 0x60 6 PCI Lvl Fixed 3 1 0x1/0x4 bnx_intr_1lvl 48 0x61 6 PCI Lvl Fixed 2 1 0x1/0x10 bnx_intr_1lvl Does this imply that my system is not in a vulnerable configuration? Supposedly i'm losing some performance without MSI-X, but I'm not sure in which environments or workloads we would notice since the load on this server is relatively low, and the L2ARC serves data at greater than 100MB/s (wire speed) without stressing much of anything. The BIOS settings in our T610 are exactly as they arrived from Dell when we bought it over a year ago. Thoughts? --eric Unfortunately I see irq type fixed in system that suffers from network issues with bnx. But yes, Regarding to redhat material this has something to do with Nehalem c-states (power saving etc) and/or MSI. If your system has been running for year or so, I wouldn't expect this issue to come up, we have noted this issue with R410/R710 mostly that are manufactured in Q4/2009-Q1/2010 (different hw revisions?) Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?
Install nexenta on a dell poweredge ? or one of these http://www.pogolinux.com/products/storage_director FYI; More recent poweredges (R410,R710, possibly blades too, those with integrated Broadcom chips) are not working very well with opensolaris due broadcom network issues, hang-ups packet loss etc. And as opensolaris is not supported OS Dell is not interested to fix these issues. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?
Our Dell T610 is and has been working just fine for the last year and a half, without a single network problem. Do you know if they're using the same integrated part? --eric Hi, as I should have mentioned, integrated nics that cause issues are using Broadcom BCM5709 chipset and these connectivity issues have been quite widespread amongst linux people too, Redhat tries to fix this; http://kbase.redhat.com/faq/docs/DOC-26837 but I believe it's messed up in firmware somehow, as in our tests show 4.6.8-series firmware seems to be more stable. And what comes to workarounds, disabling msi is bad if it creates latency for network/disk controllers and disabling c-states from Nehalem processors is just stupid (having no turbo, power saving etc). Definitely no go for storage imo. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool vdev imbalance - getting worse?
This system has since been upgraded, but the imbalance in getting worse: zpool iostat -v tank | grep raid raidz2 3.60T 28.5G166 41 6.97M 764K raidz2 3.59T 33.3G170 35 7.35M 709K raidz2 3.60T 26.1G173 35 7.36M 658K raidz2 1.69T 1.93T129 46 6.70M 610K raidz2 2.25T 1.38T124 54 5.77M 967K Is there any way to determine how this is happening? I may have to resort to destroying and recreating some large filesystems, but there's no way to determine which ones to target... -- Ian. Hi, if you have had faulted disks in some raidsets that would explain imbalance as zfs avoids writing to them while they are in faulted state. I've encountered similar imbalance but that is due later changes in pool configuration where vdev's were added after first one's got too full. Anyway, this is an issue, as your writes will definitely get slower after first raidsets get more full, as mine did, writes went from 1.2GB/s to 40-50KB/s and freeing up some space made problem go away (total pool capacity was around 60%). Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_133 mpt0 freezing machine
-Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Bruno Sousa Sent: 5. maaliskuuta 2010 10:34 To: ZFS filesystem discussion list Subject: [zfs-discuss] snv_133 mpt0 freezing machine Hi all, Recently i got myself a new machine (Dell R710) with 1 internal Dell SAS/i and 2 sun hba (non-raid) . From time to time this system just freezes and i noticed that it always freezes after this message (shown in the /var/adm/messages) : scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@4/pci1028,1...@0 (mpt0): Does anyone has any tip in how to start to trace the problem ? Best regards, Bruno I'm not sure about this issue but I just have to say that dell supplies SAS 5E controller which is basically Dell oem'd LSI controller similar to SUN hba. These controllers seem to work well enough with R710. (just be sure to downgrade bios and nicfw to 1.1.4 and 4.x more recent firmware causes network issues:) Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_133 mpt0 freezing machine
-Original Message- From: Bruno Sousa [mailto:bso...@epinfante.com] Sent: 5. maaliskuuta 2010 13:04 To: Markus Kovero Cc: ZFS filesystem discussion list Subject: Re: [zfs-discuss] snv_133 mpt0 freezing machine Hi Markus, Thanks for your input and regarding the broadcom fw i already hitted that issue and have downgraded it. However for the Dell Bios i couldn't find anything older than 1.2.6. Do you have by any chance the url for getting bios 1.1.4 like you say? Bruno Hi, you can downgrade bios and nicfw quite easily using USC and dvd downloadable from here (.001 and .002, use dd or copy to make .iso out of them); http://support.dell.com/support/downloads/format.aspx?c=usl=ens=genSystemID=pwe_r710servicetag=os=WNETosl=endeviceid=16823libid=36dateid=-1typeid=-1formatid=-1catid=-1impid=-1typecnt=0vercnt=5releaseid=R236931 Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk controllers changing the names of disks
I am curious how admins are dealing with controllers like the Dell Perc 5 and 6 that can change the device name on a disk if a disk fails and the machine reboots. These controllers are not nicely behaved in that they happily fill in the device numbers for the physical drive that is missing. In that case, how can you recover the zpool that was on the disk? I understand if the pool was exported, you can then re-import it. However, what happens if the machine completely dies and you have no chance to export the pool? -- Terry -- You still can import it, Although you might loose some inflight data that was going in during crash and it can take a while during import to finish transactions, anyway, it will be fine. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
No one has said if they're using dks, rdsk, or file-backed COMSTAR LUNs yet. I'm using file-backed COMSTAR LUNs, with ZIL currently disabled. I can get between 100-200MB/sec, depending on random/sequential and block sizes. Using dsk/rdsk, I was not able to see that level of performance at all. -- Brent Jones br...@servuhome.net Hi, I find comstar performance very low if using zvols under dsk, somehow using them under rdsk and letting comstar to handle cache makes performance really good (disks/nics become limiting factor). Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
The other thing I've noticed with all of the destroyed a large dataset with dedup enabled and it's taking forever to import/destory/insert function here questions is that the process runs so so so much faster with 8+ GiB of RAM. Almost to a man, everyone who reports these 3, 4, or more day destroys has 8 GiB of RAM on the storage server. I've witnessed destroys that take several days with 24GB+ systems (dataset over 30TB). I guess it's just matter of how large datasets vs. how much ram. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I/O Read starvation
Hi, it seems you might have somekind of hardware issue there, I have no way reproducing this. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of bank kus Sent: 10. tammikuuta 2010 7:21 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] I/O Read starvation Btw FWIW if I redo the dd + 2 cp experiment on /tmp the result is far more disastrous. The GUI stops moving caps lock stops responding for large intervals no clue why. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Clearing a directory with more than 60 million files
Hi, while not providing complete solution, I'd suggest turning atime off so find/rm does not change access time and possibly destroying unnecessary snapshots before removing files, should be quicker. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Mikko Lammi Sent: 5. tammikuuta 2010 12:35 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] Clearing a directory with more than 60 million files Hello, As a result of one badly designed application running loose for some time, we now seem to have over 60 million files in one directory. Good thing about ZFS is that it allows it without any issues. Unfortunatelly now that we need to get rid of them (because they eat 80% of disk space) it seems to be quite challenging. Traditional approaches like find ./ -exec rm {} \; seem to take forever - after running several days, the directory size still says the same. The only way how I've been able to remove something has been by giving rm -rf to problematic directory from parent level. Running this command shows directory size decreasing by 10,000 files/hour, but this would still mean close to ten months (over 250 days) to delete everything! I also tried to use unlink command to directory as a root, as a user who created the directory, by changing directory's owner to root and so forth, but all attempts gave Not owner error. Any commands like ls -f or find will run for hours (or days) without actually listing anything from the directory, so I'm beginning to suspect that maybe the directory's data structure is somewhat damaged. Is there some diagnostics that I can run with e.g zdb to investigate and hopefully fix for a single directory within zfs dataset? To make things even more difficult, this directory is located in rootfs, so dropping the zfs filesystem would basically mean reinstalling the entire system, which is something that we really wouldn't wish to go. OS is Solaris 10, zpool version is 10 (rather old, I know, but is there easy path for upgrade that might solve this problem?) and the zpool consists two 146 GB SAS drivers in a mirror setup. Any help would be appreciated. Thanks, Mikko -- Mikko Lammi | l...@lmmz.net | http://www.lmmz.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
If pool isnt rpool you might to want to boot into singleuser mode (-s after kernel parameters on boot) remove /etc/zfs/zpool.cache and then reboot. after that you can merely ssh into box and watch iostat while import. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
Hey Markus, Thanks for the suggestion, but as stated in the thread, I am booting using -s -kv -m verbose and deleting the cache file was one of the first troubleshooting steps we and the others affected did.The other problem is that we were all starting an iostat at the console and ssh'ing in during multiuser mode and starting the import, but the eventual hang starts hanging iostat as well and kills the ssh. Seems like this issue is effecting more users than just me judging from this and the other threads I've been watching. Update on the other stuff. This is day 3 of my import and still no joy. Thanks, ~Bryan Oh, my bad I didnt go thru thread so closely, anyway, seems bit odd it's blocking I/O completely, have you tried reading from pools member disks with dd before import and checking iostat error counters for hw/transport errors? Did you try with different set of RAM on other server, faulty ram could do this as well. And is your swap device okay, if it happens to swap during import into faulty pool/device it might cause interesting behavior as well. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
Hi, Try to add flow for traffic you want to get prioritized, I noticed that opensolaris tends to drop network connectivity without priority flows defined, I believe this is a feature presented by crossbow itself. flowadm is your friend that is. I found this particularly annoying if you monitor servers with icmp-ping and high load causes checks to fail therefore triggering unnecessary alarms. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Saso Kiselkov Sent: 28. joulukuuta 2009 15:25 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] ZFS write bursts cause short app stalls -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I progressed with testing a bit further and found that I was hitting another scheduling bottleneck - the network. While the write burst was running and ZFS was commiting data to disk, the server was dropping incomming UDP packets (netstat -s | grep udpInOverflows grew by about 1000-2000 packets during every write burst). To work around that I had to boost the scheduling priority of recorder processes to the real-time class and I also had to lower zfs_txg_timeout=1 (there was still minor packet drop after just doing priocntl on the processes) to even out the CPU load. Any ideas on why ZFS should completely thrash the network layer and make it drop incomming packets? Regards, - -- Saso Robert Milkowski wrote: On 26/12/2009 12:22, Saso Kiselkov wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Thank you, the post you mentioned helped me move a bit forward. I tried putting: zfs:zfs_txg_timeout = 1 btw: you can tune it on a live system without a need to do reboots. mi...@r600:~# echo zfs_txg_timeout/D | mdb -k zfs_txg_timeout: zfs_txg_timeout:30 mi...@r600:~# echo zfs_txg_timeout/W0t1 | mdb -kw zfs_txg_timeout:0x1e= 0x1 mi...@r600:~# echo zfs_txg_timeout/D | mdb -k zfs_txg_timeout: zfs_txg_timeout:1 mi...@r600:~# echo zfs_txg_timeout/W0t30 | mdb -kw zfs_txg_timeout:0x1 = 0x1e mi...@r600:~# echo zfs_txg_timeout/D | mdb -k zfs_txg_timeout: zfs_txg_timeout:30 mi...@r600:~# -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks4sa8ACgkQRO8UcfzpOHAASgCdF1QWcKvpvK58BPBVr9EDmrWK zmoAoLeX3Q+avIDbb+CONlh++pAIGOob =NcRo -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Troubleshooting dedup performance
Hi, I threw 24GB of ram and couple latest nehalems at it and dedup=on seemed to cripple performance without actually using much cpu or ram. it's quite unusable like this. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] snv_129 dedup panic
Hi, I encountered panic and spontaneous reboot after canceling zfs send from another server. It took around 2-3hrs to remove 2TB data server had sent and then: Dec 15 16:54:05 foo ^Mpanic[cpu2]/thread=ff0916724560: Dec 15 16:54:05 foo genunix: [ID 683410 kern.notice] BAD TRAP: type=0 (#de Divide error) rp=ff003db82910 addr=ff003db82a10 Dec 15 16:54:05 foo unix: [ID 10 kern.notice] Dec 15 16:54:05 foo unix: [ID 839527 kern.notice] zpool: Dec 15 16:54:05 foo unix: [ID 753105 kern.notice] #de Divide error Dec 15 16:54:05 foo unix: [ID 358286 kern.notice] addr=0xff003db82a10 Dec 15 16:54:05 foo unix: [ID 243837 kern.notice] pid=15520, pc=0xf794310a, sp=0xff003db82a00, eflags=0x10246 Dec 15 16:54:05 foo unix: [ID 211416 kern.notice] cr0: 80050033pg,wp,ne,et,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de Dec 15 16:54:05 foo unix: [ID 624947 kern.notice] cr2: 80a7000 Dec 15 16:54:05 foo unix: [ID 625075 kern.notice] cr3: 4721dc000 Dec 15 16:54:05 foo unix: [ID 625715 kern.notice] cr8: c Dec 15 16:54:05 foo unix: [ID 10 kern.notice] Dec 15 16:54:05 foo unix: [ID 592667 kern.notice] rdi: ff129712b578 rsi: rdx:0 Dec 15 16:54:05 foo unix: [ID 592667 kern.notice] rcx:1 r8:173724e00 r9:0 Dec 15 16:54:05 foo unix: [ID 592667 kern.notice] rax:173724e00 rbx:8 rbp: ff003db82a90 Dec 15 16:54:05 foo unix: [ID 592667 kern.notice] r10: afd231db9a85b86e r11: 3fc244aaa90 r12:0 Dec 15 16:54:05 foo unix: [ID 592667 kern.notice] r13: ff12fed0e9d0 r14: ff092953d000 r15: ff003db82a10 Dec 15 16:54:05 foo unix: [ID 592667 kern.notice] fsb:0 gsb: ff09128e9000 ds: 4b Dec 15 16:54:05 foo unix: [ID 592667 kern.notice]es: 4b fs:0 gs: 1c3 Dec 15 16:54:06 foo unix: [ID 592667 kern.notice] trp:0 err:0 rip: f794310a Dec 15 16:54:06 foo unix: [ID 592667 kern.notice]cs: 30 rfl:10246 rsp: ff003db82a00 Dec 15 16:54:06 foo unix: [ID 266532 kern.notice]ss: 38 Dec 15 16:54:06 foo unix: [ID 10 kern.notice] Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db827f0 unix:die+10f () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82900 unix:trap+1558 () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82910 unix:cmntrap+e6 () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82a90 zfs:ddt_get_dedup_object_stats+152 () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82b00 zfs:spa_config_generate+2d9 () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82b90 zfs:spa_open_common+1c2 () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82c00 zfs:spa_get_stats+50 () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82c40 zfs:zfs_ioc_pool_stats+32 () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82cc0 zfs:zfsdev_ioctl+175 () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82d00 genunix:cdev_ioctl+45 () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82d40 specfs:spec_ioctl+5a () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82dc0 genunix:fop_ioctl+7b () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82ec0 genunix:ioctl+18e () Dec 15 16:54:06 foo genunix: [ID 655072 kern.notice] ff003db82f10 unix:brand_sys_syscall32+19d () Dec 15 16:54:06 foo unix: [ID 10 kern.notice] Dec 15 16:54:06 foo genunix: [ID 672855 kern.notice] syncing file systems... Dec 15 16:54:06 foo genunix: [ID 904073 kern.notice] done Dec 15 16:54:07 foo genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel Dec 15 16:55:07 foo genunix: [ID 10 kern.notice] Dec 15 16:55:07 foo genunix: [ID 665016 kern.notice] ^M 64% done: 1881224 pages dumped, Dec 15 16:55:07 foo genunix: [ID 495082 kern.notice] dump failed: error 28 Is it just me or everlasting Monday again. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Space not freed?
Hi, if someone running 129 could try this out, turn off compression in your pool, mkfile 10g /pool/file123, see used space and then remove the file and see if it makes used space available again. I'm having trouble with this, reminds me of similar bug that occurred in 111-release. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Space not freed?
Hi, if someone running 129 could try this out, turn off compression in your pool, mkfile 10g /pool /file123, see used space and then remove the file and see if it makes used space available again. I 'm having trouble with this, reminds me of similar bug that occurred in 111-release. Any automatically created snapshots, perhaps? Casper Nope, no snapshots. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
How you can setup these values to fma? Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of R.G. Keen Sent: 14. joulukuuta 2009 20:14 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL FMA (not ZFS, directly) looks for a number of failures over a period of time. By default that is 10 failures in 10 minutes. If you have an error that trips on TLER, the best it can see is 2-3 failures in 10 minutes. The symptom you will see is that when these long timeouts happen, they take a long time because, by default, the drive will be reset and the I/O retried after 60 seconds. That's very good news. I'm trying to get the stuff together to set up my zfs server, and I'm also perfectly willing to trade slower operation and more disks to get zfs' scrubbing and other operations. The recent discovery that WD has decided to up its prices in a back-door manner by making sure that the DIY RAID folks can't modify TLER on cheaper drives was a real slap in the face, potentially more than doubling the price of storage. I've dealt with the MBA mentality before, and I don't like it. :-| This discovery was bad enough to almost put me off building a server entirely, with the apparent options of paying 100% more for the disks or having the array suffer 100% data loss on any significant read/write error. So let me be sure I understand. If I'm using solaris/zfs, I can use FMA to set the level of retries/time to be waited if I get a disk error before taking a disk out of the array. Is that correct? If it is, and that can be set to allow an array of disks to tolerate most instances of read/write errors without corrupting an entire array, then I'm back on with the server scheme. The whole point of going to solaris/zfs is background scrubbing for me. I'm willing for it to be slow - however slow it is, it's much faster than finding the backup DVDs in the closet, pilfering through them to find the right one, then finding out the DVD set has bit-rot too. I apologize for the baby-simple questions. I'm reading documentation as hard as I can, but there's a world of difference between reading documentation and understanding and using the tools described. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
Hi, are you sure zfs isnt just going thru transactions after forcibly stopping zfs destroy? Sometimes (always) it seems zfs/zpool commands just hang if you destroy larger filesets, in reality zfs is just doing its job, if you reboot server during dataset destroy it will take some time to come up. So how long you've waited, have you tried removing /etc/zfs/zpool.cache and then booting into snv_128, doing import and possibly watching disk with iostat to see is there any activity? Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Jack Kielsmeier Sent: 8. joulukuuta 2009 6:08 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled Howdy, I upgraded to snv_128a from snv_125 . I wanted to do some de-dup testing :). I have two zfs pools: rpool and vault. I upgraded my vault zpool version and turned on dedup on datastore vault/shared_storage. I also turned on gzip compression on this dataset as well. Before I turned on dedup, I made a new datastore and copied all data to vault/shared_storage_temp (just in case something crazy happened to my dedup'd datastore, since dedup is new). I removed all data on my dedup'd datastore and copied all data from my temp datastore. After I realized my space savings wasn't going to be that great, I decided to delete vault/shared_storage dataset. zfs destroy vault/shared_storage This hung, and couldn't be killed. I force rebooted my system, and I couldn't boot into Solaris. It hung at reading zfs config I then booted into single user mode (multiple times) and any zfs or zpool commands froze. I then rebooted to my snv_125 environment. As it should, it ignored my vault zpool, as it's version is higher than it can understand. I forced an zpool export of vault and rebooted, I could then boot back into snv_128 and zpool import listed the pool of vault. However, I cannot import via name or identifier, the command hangs, as well as any additional zfs or zpool commands. I cannot kill or kill -9 the processes. Is there anything I can do to get my pool imported? I haven't done much troubleshooting at all on opensolairs, I'd be happy to run any suggested commands and provide output. Thank you for the assistance. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
From what I've noticed, if one destroys dataset that is say 50-70TB and reboots before destroy is finished, it can take up to several _days_ before it's back up again. So, nowadays I'm doing rm -fr BEFORE issuing zfs destroy whenever possible. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Michael Herf Sent: 9. joulukuuta 2009 9:38 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled Am in the same boat, exactly. Destroyed a large set and rebooted, with a scrub running on the same pool. My reboot stuck on Reading ZFS Config: * for several hours (disks were active). I cleared the zpool.cache from single-user and am doing an import (can boot again). I wasn't able to boot my 123 build (kernel panic), even though my rpool is an older version. zpool import is pegging all 4 disks in my RAIDZ-1. Can't touch zpool/zfs commands during the import or they hang...but regular iostat is ok for watching what's going on. I didn't limit ARC memory (box has 6GB), we'll see if that's ok. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
We actually tried this, although using sol10-version of mpt-driver. Surprisingly it didn't work :-) Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Mark Johnson Sent: 1. joulukuuta 2009 15:57 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] mpt errors on snv 127 Mark Johnson wrote: Chad Cantwell wrote: Hi, I was using for quite awhile OpenSolaris 2009.06 with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of about ~20T and this worked perfectly fine (no issues or device errors logged for several months, no hanging). A few days ago I decided to reinstall with the latest OpenSolaris in order to take advantage of raidz3. Just to be clear... The same setup was working fine on osol2009.06, you upgraded to b127 and it started failing? Did you keep the osol2009.06 be around so you can reboot back to it? If so, have you tried the osol2009.06 mpt driver in the BE with the latest bits (make sure you make a backup copy of the mpt driver)? What's the earliest build someone has seen this problem? i.e. if we binary chop, has anyone seen it in b118? I have no idea if the old mpt drivers will work on a new kernel... But if someone wants to try... Something like the following should work... # first, I would work out of a test BE in case you # mess something up. beadm create test-be beadm activate test-be reboot # assuming your lasted BE is call snv127, mount it and backup # the stock mpt driver and conf file. beadm mount snv127 /mnt cp /mnt/kernel/drv/mpt.conf /mnt/kernel/drv/mpt.conf.orig cp /mnt/kernel/drv/amd64/mpt /mnt/kernel/drv/amd64/mpt.orig # see what builds are out there... pkg search /kernel/drv/amd64/mpt # There's probably an easier way to do this... # grab an older mpt. This will take a while since it's # not in it's own package and ckr has some dependencies # so it will pull in a bunch of other packages. # change out 118 with the build you want to grab. mkdir /tmp/mpt pkg image-create -f -F -a opensolaris.org=http://pkg.opensolaris.org/dev /tmp/mpt pkg -R /tmp/mpt/ install sunw...@0.5.11-0.118 cp /tmp/mpt/kernel/drv/mpt.conf /mnt/kernel/drv/mpt.conf cp /tmp/mpt/kernel/drv/amd64/mpt /mnt/kernel/drv/amd64/mpt rm -rf /tmp/mpt/ bootadm update-archive -R /mnt MRJ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on JBOD storage, mpt driver issue - server not responding
Hi, you could try LSI itmpt driver as well, it seems to handle this better, although I think it only supports 8 devices at once or so. You could also try more recent version of opensolaris (123 or even 126), as there seems to be a lot fixes regarding mpt-driver (which still seems to have issues). Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of M P Sent: 11. marraskuuta 2009 18:08 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] ZFS on JBOD storage, mpt driver issue - server not responding Server using [b]Sun StorageTek 8-port external SAS PCIe HBA [/b](mpt driver) connected to external JBOD array with 12 disks. Here is link to the exact SAS (Sun) adapter: http://www.sun.com/storage/storage_networking/hba/sas/PCIe.pdf (LSI SAS3801) When running IO intensive operations (zpool scrub) for couple of hours, the server locks with the following repeating messages: Nov 10 16:31:45 sunserver scsi: [ID 365881 kern.info] /p...@0,0/pci10de,3...@a/pci1000,3...@0 (mpt0): Nov 10 16:31:45 sunserver Log info 0x3114 received for target 17. Nov 10 16:31:45 sunserver scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Nov 10 16:32:55 sunserver scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci10de,3...@a/pci1000,3...@0 (mpt0): Nov 10 16:32:55 sunserver Disconnected command timeout for Target 19 Nov 10 16:32:56 sunserver scsi: [ID 365881 kern.info] /p...@0,0/pci10de,3...@a/pci1000,3...@0 (mpt0): Nov 10 16:32:56 sunserver Log info 0x3114 received for target 19. Nov 10 16:32:56 sunserver scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Nov 10 16:34:16 sunserver scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci10de,3...@a/pci1000,3...@0 (mpt0): Nov 10 16:34:16 sunserver Disconnected command timeout for Target 21 I tested this on two servers: - [b]Sun Fire X2200[/b] using [b]Sun Storage J4200 JBOD[/b] array and - [b]Dell R410 Server[/b] with [b]Promise VTJ-310SS JBOD array[/b] They both are showing the same repeating messages and locking after couple of hours of zpool scrub. Solaris appears to be more stable (than OpenSolaris) - it doesn't lock when scrubbing, but still locks after 5-6 hours reading from the JBOD array - 10TB size. So at this point this looks like an issue with the MPT driver or these SAS cards (I tested two) when under heavy load. I put the latest firmware for the SAS card from LSI's web site - v1.29.00 without any changes, server still locks. Any ideas, suggestions how to fix or workaround this issue? The adapter is suppose to be enterprise-class. Here is more detailed log info: Sun Fire X2200 and Sun Storage J4200 JBOD array SAS card: Sun StorageTek 8-port external SAS PCIe HBA http://www.sun.com/storage/storage_networking/hba/sas/PCIe.pdf (LSI SAS3801) Operation System: SunOS sunserver 5.11 snv_111b i86pc i386 i86pc Solaris Nov 10 16:30:33 sunserver scsi: [ID 365881 kern.info] /p...@0,0/pci10de,3...@a/pci1000,3...@0 (mpt0): Nov 10 16:30:33 sunserver Log info 0x3114 received for target 0. Nov 10 16:30:33 sunserver scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Nov 10 16:31:43 sunserver scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci10de,3...@a/pci1000,3...@0 (mpt0): Nov 10 16:31:43 sunserver Disconnected command timeout for Target 17 Nov 10 16:32:55 sunserver scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci10de,3...@a/pci1000,3...@0 (mpt0): Nov 10 16:32:55 sunserver Disconnected command timeout for Target 19 Nov 10 16:32:56 sunserver scsi: [ID 365881 kern.info] /p...@0,0/pci10de,3...@a/pci1000,3...@0 (mpt0): Nov 10 16:32:56 sunserver Log info 0x3114 received for target 19. Nov 10 16:32:56 sunserver scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc Nov 10 16:34:16 sunserver scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci10de,3...@a/pci1000,3...@0 (mpt0): Nov 10 16:34:16 sunserver Disconnected command timeout for Target 21 Dell R410 Server and Promise VTJ-310SS JBOD array SAS card: Sun StorageTek 8-port external SAS PCIe HBA Operating System: SunOS dellserver 5.10 Generic_141445-09 i86pc i386 i86pc Nov 11 00:18:22 dellserver scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@3/pci1028,1...@0 (mpt0): Nov 11 00:18:22 dellserver Disconnected command timeout for Target 0 Nov 11 00:18:22 dellserver scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@3/pci1028,1...@0/s...@0,0 (sd13): Nov 11 00:18:22 dellserver Error for Command: read(10) Error Level: Retryable Nov 11 00:18:22 dellserver scsi: [ID 107833 kern.notice] Requested Block: 276886498 Error Block: 276886498 Nov 11 00:18:22 dellserver scsi: [ID 107833 kern.notice] Vendor: Dell Serial Number: Dell Interna
Re: [zfs-discuss] ZFS on JBOD storage, mpt driver issue - server not responding
Have you tried another SAS-cable? Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of M P Sent: 11. marraskuuta 2009 21:05 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] ZFS on JBOD storage, mpt driver issue - server not responding I already changed some of the drives, no difference. The target drive seem to have random character - most likely not from the drives. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SNV_125 MPT warning in logfile
How do you estimate needed queue depth if one has say 64 to 128 disks sitting behind LSI? Is it bad idea having queuedepth 1? Yours Markus Kovero Lähettäjä: zfs-discuss-boun...@opensolaris.org [zfs-discuss-boun...@opensolaris.org] k#228;ytt#228;j#228;n Richard Elling [richard.ell...@gmail.com] puolesta Lähetetty: 24. lokakuuta 2009 7:36 Vastaanottaja: Adam Cheal Kopio: zfs-discuss@opensolaris.org Aihe: Re: [zfs-discuss] SNV_125 MPT warning in logfile ok, see below... On Oct 23, 2009, at 8:14 PM, Adam Cheal wrote: Here is example of the pool config we use: # zpool status pool: pool002 state: ONLINE scrub: scrub stopped after 0h1m with 0 errors on Fri Oct 23 23:07:52 2009 config: NAME STATE READ WRITE CKSUM pool002 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c9t18d0 ONLINE 0 0 0 c9t17d0 ONLINE 0 0 0 c9t55d0 ONLINE 0 0 0 c9t13d0 ONLINE 0 0 0 c9t15d0 ONLINE 0 0 0 c9t16d0 ONLINE 0 0 0 c9t11d0 ONLINE 0 0 0 c9t12d0 ONLINE 0 0 0 c9t14d0 ONLINE 0 0 0 c9t9d0 ONLINE 0 0 0 c9t8d0 ONLINE 0 0 0 c9t10d0 ONLINE 0 0 0 c9t29d0 ONLINE 0 0 0 c9t28d0 ONLINE 0 0 0 c9t27d0 ONLINE 0 0 0 c9t23d0 ONLINE 0 0 0 c9t25d0 ONLINE 0 0 0 c9t26d0 ONLINE 0 0 0 c9t21d0 ONLINE 0 0 0 c9t22d0 ONLINE 0 0 0 c9t24d0 ONLINE 0 0 0 c9t19d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c9t30d0 ONLINE 0 0 0 c9t31d0 ONLINE 0 0 0 c9t32d0 ONLINE 0 0 0 c9t33d0 ONLINE 0 0 0 c9t34d0 ONLINE 0 0 0 c9t35d0 ONLINE 0 0 0 c9t36d0 ONLINE 0 0 0 c9t37d0 ONLINE 0 0 0 c9t38d0 ONLINE 0 0 0 c9t39d0 ONLINE 0 0 0 c9t40d0 ONLINE 0 0 0 c9t41d0 ONLINE 0 0 0 c9t42d0 ONLINE 0 0 0 c9t44d0 ONLINE 0 0 0 c9t45d0 ONLINE 0 0 0 c9t46d0 ONLINE 0 0 0 c9t47d0 ONLINE 0 0 0 c9t48d0 ONLINE 0 0 0 c9t49d0 ONLINE 0 0 0 c9t50d0 ONLINE 0 0 0 c9t51d0 ONLINE 0 0 0 c9t52d0 ONLINE 0 0 0 cache c8t2d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 spares c9t20d0AVAIL c9t43d0AVAIL errors: No known data errors pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c8t0d0s0 ONLINE 0 0 0 c8t1d0s0 ONLINE 0 0 0 errors: No known data errors ...and here is a snapshot of the system using iostat -indexC 5 during a scrub of pool002 (c8 is onboard AHCI controller, c9 is LSI SAS 3801E): extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c8 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c8t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c8t1d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c8t2d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c8t3d0 8738.70.0 555346.10.0 0.1 345.00.0 39.5 0 3875 0 1 1 2 c9 You see 345 entries in the active queue. If the controller rolls over at 511 active entries, then it would explain why it would soon begin to have difficulty. Meanwhile, it is providing 8,738 IOPS and 555 MB/sec, which is quite respectable. 194.80.0 11936.90.0 0.0 7.90.0 40.3 0 87 0 0 0 0 c9t8d0 These disks are doing almost 200 read IOPS, but are not 100% busy. Average I/O size is 66 KB, which is not bad, lots of little I/Os could be worse, but at only 11.9 MB/s, you are not near the media bandwidth. Average service time is 40.3 milliseconds, which
Re: [zfs-discuss] SNV_125 MPT warning in logfile
We actually hit similar issues with LSI, but within workload not scrub, result is same but it seems to choke on writes rather than reads with suboptimal performance. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6891413 Anyway, we haven't experienced this _at all_ with RE3-version of Western Digital disks.. Issues seem to pop up with 750GB seagate and 1TB WD black-series, so far 2TB green WDs seem unaffected too, so might it be related to disks firmware due how they chat with LSI? Also, we noticed more severe (even RE3 and 2TBWD green) timeouts if disks are not forced into SATA1-mode, I believe this is known issue with newer 2TB disks and some other disk controllers and may be caused by bad cabling or connectivity. We have never witnessed this behaviour with SAS (fujitsu,ibm..) also. All this happens with snv 118,122,123 and 125. Yours Markus Kovero Lähettäjä: zfs-discuss-boun...@opensolaris.org [zfs-discuss-boun...@opensolaris.org] k#228;ytt#228;j#228;n Adam Cheal [ach...@pnimedia.com] puolesta Lähetetty: 24. lokakuuta 2009 12:49 Vastaanottaja: zfs-discuss@opensolaris.org Aihe: Re: [zfs-discuss] SNV_125 MPT warning in logfile The iostat I posted previously was from a system we had already tuned the zfs:zfs_vdev_max_pending depth down to 10 (as visible by the max of about 10 in actv per disk). I reset this value in /etc/system to 7, rebooted, and started a scrub. iostat output showed busier disks (%b is higher, which seemed odd) but a cap of about 7 queue items per disk, proving the tuning was effective. iostat at a high-water mark during the test looked like this: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c8 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t1d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t2d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c8t3d0 8344.50.0 359640.40.0 0.1 300.50.0 36.0 0 4362 c9 190.00.0 6800.40.0 0.0 6.60.0 34.8 0 99 c9t8d0 185.00.0 6917.10.0 0.0 6.10.0 32.9 0 94 c9t9d0 187.00.0 6640.90.0 0.0 6.50.0 34.6 0 98 c9t10d0 186.50.0 6543.40.0 0.0 7.00.0 37.5 0 100 c9t11d0 180.50.0 7203.10.0 0.0 6.70.0 37.2 0 100 c9t12d0 195.50.0 7352.40.0 0.0 7.00.0 35.8 0 100 c9t13d0 188.00.0 6884.90.0 0.0 6.60.0 35.2 0 99 c9t14d0 204.00.0 6990.10.0 0.0 7.00.0 34.3 0 100 c9t15d0 199.00.0 7336.70.0 0.0 7.00.0 35.2 0 100 c9t16d0 180.50.0 6837.90.0 0.0 7.00.0 38.8 0 100 c9t17d0 198.00.0 7668.90.0 0.0 7.00.0 35.3 0 100 c9t18d0 203.00.0 7983.20.0 0.0 7.00.0 34.5 0 100 c9t19d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c9t20d0 195.50.0 7096.40.0 0.0 6.70.0 34.1 0 98 c9t21d0 189.50.0 7757.20.0 0.0 6.40.0 33.9 0 97 c9t22d0 195.50.0 7645.90.0 0.0 6.60.0 33.8 0 99 c9t23d0 194.50.0 7925.90.0 0.0 7.00.0 36.0 0 100 c9t24d0 188.50.0 6725.60.0 0.0 6.20.0 32.8 0 94 c9t25d0 188.50.0 7199.60.0 0.0 6.50.0 34.6 0 98 c9t26d0 196.00.0 .90.0 0.0 6.30.0 32.1 0 95 c9t27d0 193.50.0 7455.40.0 0.0 6.20.0 32.0 0 95 c9t28d0 189.00.0 7400.90.0 0.0 6.30.0 33.2 0 96 c9t29d0 182.50.0 9397.00.0 0.0 7.00.0 38.3 0 100 c9t30d0 192.50.0 9179.50.0 0.0 7.00.0 36.3 0 100 c9t31d0 189.50.0 9431.80.0 0.0 7.00.0 36.9 0 100 c9t32d0 187.50.0 9082.00.0 0.0 7.00.0 37.3 0 100 c9t33d0 188.50.0 9368.80.0 0.0 7.00.0 37.1 0 100 c9t34d0 180.50.0 9332.80.0 0.0 7.00.0 38.8 0 100 c9t35d0 183.00.0 9690.30.0 0.0 7.00.0 38.2 0 100 c9t36d0 186.00.0 9193.80.0 0.0 7.00.0 37.6 0 100 c9t37d0 180.50.0 8233.40.0 0.0 7.00.0 38.8 0 100 c9t38d0 175.50.0 9085.20.0 0.0 7.00.0 39.9 0 100 c9t39d0 177.00.0 9340.00.0 0.0 7.00.0 39.5 0 100 c9t40d0 175.50.0 8831.00.0 0.0 7.00.0 39.9 0 100 c9t41d0 190.50.0 9177.80.0 0.0 7.00.0 36.7 0 100 c9t42d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c9t43d0 196.00.0 9180.50.0 0.0 7.00.0 35.7 0 100 c9t44d0 193.50.0 9496.80.0 0.0 7.00.0 36.2 0 100 c9t45d0 187.00.0 8699.50.0 0.0 7.00.0 37.4 0 100 c9t46d0 198.50.0 9277.00.0 0.0 7.00.0 35.2 0 100 c9t47d0 185.50.0 9778.30.0 0.0 7.00.0
[zfs-discuss] Numbered vdevs
Hi, I just noticed this on snv_125, is there oncoming feature that allows use of numbered vdevs or what for are these? (raidz2-N) pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c8t40d0ONLINE 0 0 0 c8t36d0ONLINE 0 0 0 c8t38d0ONLINE 0 0 0 c8t39d0ONLINE 0 0 0 c8t41d0ONLINE 0 0 0 c8t42d0ONLINE 0 0 0 c8t43d0ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c8t44d0ONLINE 0 0 0 c8t45d0ONLINE 0 0 0 c8t46d0ONLINE 0 0 0 c8t47d0ONLINE 0 0 0 c8t48d0ONLINE 0 0 0 c8t49d0ONLINE 0 0 0 c8t50d0ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 c8t51d0ONLINE 0 0 0 c8t86d0ONLINE 0 0 0 c8t87d0ONLINE 0 0 0 c8t149d0 ONLINE 0 0 0 c8t91d0ONLINE 0 0 0 c8t94d0ONLINE 0 0 0 c8t95d0ONLINE 0 0 0 Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Unusual latency issues
Hi, this may not be correct mailinglist for this, but I'd like to share this with you, I noticed weird network behavior with osol snv_123. icmp for host lags randomly between 500ms-5000ms and ssh sessions seem to tangle, I guess this could affect iscsi/nfs as well. what was most intresting that I found workaround to be running snoop with promiscuous mode disabled on interfaces suffering lag, this did make interruptions go away. Is this somekind cpu/irq scheduling issue? Behaviour was noticed on two different platform and with two different nics (bge and e1000). Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Migrate from iscsitgt to comstar?
Is it possible to migrate data from iscsitgt for comstar iscsi target? I guess comstar wants metadata at beginning of volume and this makes things difficult? Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sync replication easy way?
Hi, I managed to test this out, it seems iscsitgt performance is suboptimal with this setup but somehow comstar maxes out gige easily, no performance issues there. Yours Markus Kovero -Original Message- From: Maurice Volaski [mailto:maurice.vola...@einstein.yu.edu] Sent: 11. syyskuuta 2009 20:40 To: Markus Kovero; zfs-discuss@opensolaris.org Subject: RE: [zfs-discuss] sync replication easy way? At 8:25 PM +0300 9/11/09, Markus Kovero wrote: I believe failover is best to be done manually just to be sure active node is really dead before importing it on another node, otherwise there could be serious issues I think. I believe there are many users of Linux-HA, aka heartbeat, who do failover automatically on Linux systems. You can configure a stonith device to shoot the other node in the head. I had heartbeat running on OpenSolaris, though I never tested failover. Did you get decent performance when you tested? -- Maurice Volaski, maurice.vola...@einstein.yu.edu Computing Support, Rose F. Kennedy Center Albert Einstein College of Medicine of Yeshiva University ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ versus mirrroed
It's possible to do 3-way (or more) mirrors too, so you may achieve better redundancy than raidz2/3 Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Marty Scholes Sent: 16. syyskuuta 2009 19:38 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] RAIDZ versus mirrroed Generally speaking, striping mirrors will be faster than raidz or raidz2, but it will require a higher number of disks and therefore higher cost to The main reason to use raidz or raidz2 instead of striping mirrors would be to keep the cost down, or to get higher usable space out of a fixed number of drives. While it has been a while since I have done storage management for critical systems, the advantage I see with RAIDZN is better fault tolerance: any N drives may fail before the set goes critical. With straight mirroring, failure of the wrong two drives will invalidate the whole pool. The advantage of striped mirrors is that it offers a better chance of higher iops (assuming the I/O is distributed correctly). Also, it might be easier to expand a mirror by upgrading only two drives with larger drives. With RAID, the entire stripe of drives would need to be upgraded. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] alternative hardware configurations for zfs
We've been using caviar black 1TB with disk configurations consisting 64 disks or more. They are working just fine. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Eugen Leitl Sent: 11. syyskuuta 2009 9:51 To: Eric Sproul; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] alternative hardware configurations for zfs On Thu, Sep 10, 2009 at 01:11:49PM -0400, Eric Sproul wrote: I would not use the Caviar Black drives, regardless of TLER settings. The RE3 or RE4 drives would be a better choice, since they also have better vibration tolerance. This will be a significant factor in a chassis with 20 spinning drives. Yes, I'm aware of the issue, and am using 16x RE4 drives in my current box right now (which I unfortunately had to convert to CentOS 5.3 for Oracle/ custom software compatibility reasons). I've made very bad experiences with Seagate 7200.11 in RAID in the past. Thanks for your advice against Caviar Black. Do you think above is a sensible choice? All your other choices seem good. I've used a lot of Supermicro gear with good results. The very leading-edge hardware is sometimes not supported, but I've been using http://www.supermicro.com/products/motherboard/QPI/5500/X8DAi.cfm in above box. anything that's been out for a while should work fine. I presume you're going for an Intel Xeon solution-- the peripherals on those boards a a bit better supported than the AMD stuff, but even the AMD boards work well. Yes, dual-socket quadcore Xeon. -- Eugen* Leitl a href=http://leitl.org;leitl/a http://leitl.org __ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] alternative hardware configurations for zfs
Couple months, nope. I guess there is this DOS utility provided by WD that allows you change TLER settings having TLER disabled can be problem, faulty disks timeout randomly and zfs doesn't always want to mark them as failed, sometimes it does though. Yours Markus Kovero -Original Message- From: Tristan Ball [mailto:tristan.b...@leica-microsystems.com] Sent: 11. syyskuuta 2009 10:04 To: Markus Kovero; zfs-discuss@opensolaris.org Subject: RE: [zfs-discuss] alternative hardware configurations for zfs How long have you had them in production? Were you able to adjust the TLER settings from within solaris? Thanks, Tristan. -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Markus Kovero Sent: Friday, 11 September 2009 5:00 PM To: Eugen Leitl; Eric Sproul; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] alternative hardware configurations for zfs We've been using caviar black 1TB with disk configurations consisting 64 disks or more. They are working just fine. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Eugen Leitl Sent: 11. syyskuuta 2009 9:51 To: Eric Sproul; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] alternative hardware configurations for zfs On Thu, Sep 10, 2009 at 01:11:49PM -0400, Eric Sproul wrote: I would not use the Caviar Black drives, regardless of TLER settings. The RE3 or RE4 drives would be a better choice, since they also have better vibration tolerance. This will be a significant factor in a chassis with 20 spinning drives. Yes, I'm aware of the issue, and am using 16x RE4 drives in my current box right now (which I unfortunately had to convert to CentOS 5.3 for Oracle/ custom software compatibility reasons). I've made very bad experiences with Seagate 7200.11 in RAID in the past. Thanks for your advice against Caviar Black. Do you think above is a sensible choice? All your other choices seem good. I've used a lot of Supermicro gear with good results. The very leading-edge hardware is sometimes not supported, but I've been using http://www.supermicro.com/products/motherboard/QPI/5500/X8DAi.cfm in above box. anything that's been out for a while should work fine. I presume you're going for an Intel Xeon solution-- the peripherals on those boards a a bit better supported than the AMD stuff, but even the AMD boards work well. Yes, dual-socket quadcore Xeon. -- Eugen* Leitl a href=http://leitl.org;leitl/a http://leitl.org __ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] sync replication easy way?
Hi, I was just wondering following idea, I guess somebody mentioned something similar and I'd like some thoughts on this. 1. create iscsi volume on Node-A and mount it locally with iscsiadm 2. create pool with this local iscsi-share 3. create iscsi volume on Node-B and share it to Node-A 4. create mirror from both disks on Node-A; zpool attach foopool localiscsivolume remotevolume Why not? After quick test it seems to fail and resilver like it should when nodes fail. Actual failover needs to be done manually though, but am I missing something relevant here? Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sync replication easy way?
This also makes failover more easy, as volumes are already shared via iscsi on both nodes. I have to poke it next week to see performance numbers, I could imagine it plays within expected iscsi performance, or it should atleast. Yours Markus Kovero -Original Message- From: Richard Elling [mailto:richard.ell...@gmail.com] Sent: 11. syyskuuta 2009 19:53 To: Markus Kovero Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] sync replication easy way? On Sep 11, 2009, at 5:05 AM, Markus Kovero wrote: Hi, I was just wondering following idea, I guess somebody mentioned something similar and I'd like some thoughts on this. 1. create iscsi volume on Node-A and mount it locally with iscsiadm 2. create pool with this local iscsi-share 3. create iscsi volume on Node-B and share it to Node-A 4. create mirror from both disks on Node-A; zpool attach foopool localiscsivolume remotevolume Why not? After quick test it seems to fail and resilver like it should when nodes fail. Actual failover needs to be done manually though, but am I missing something relevant here? This is more complicated than the more commonly used, simpler method: 1. create iscsi volume on Node-B, share to Node-A 2. zpool create mypool mirror local-vdev iscsi-vdev -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sync replication easy way?
I believe failover is best to be done manually just to be sure active node is really dead before importing it on another node, otherwise there could be serious issues I think. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Maurice Volaski Sent: 11. syyskuuta 2009 19:24 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] sync replication easy way? This method also allows one to nest mirroring or some RAID-z level with mirroring. When I tested it with a older build a while back, I found performance really poor, about 1-2 MB/second, but my environment was also constrained. A major showstopper had been the infamous 3 minute iSCSI timeout, which was recently fixed, http://bugs.opensolaris.org/view_bug.do?bug_id=649. How is your performance? Also, why do you think failover has to be done manually? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] This is the scrub that never ends...
Hi, I noticed that counters will not get updated if data amount increases during scrub/resilver, so if application has written new data during scrub, counter will not give realistic estimate. This happens with resilvering and scrub, somebody could fix this? Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Will Murnane Sent: 7. syyskuuta 2009 16:42 To: ZFS Mailing List Subject: [zfs-discuss] This is the scrub that never ends... I have a pool composed of a single raidz2 vdev, which is currently degraded (missing a disk): config: NAME STATE READ WRITE CKSUM pool DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c8d1 ONLINE 0 0 0 c8d0 ONLINE 0 0 0 c12t4d0 ONLINE 0 0 0 c12t3d0 ONLINE 0 0 0 c12t2d0 ONLINE 0 0 0 c12t0d0 OFFLINE 0 0 0 logs c10d0 ONLINE 0 0 0 errors: No known data errors I have it scheduled for periodic scrubs, via root's crontab: 20 2 1 * * /usr/sbin/zpool scrub pool but this scrub was kicked off manually. Last night I checked its status and saw: scrub: scrub in progress for 20h32m, 100.00% done, 0h0m to go This morning I see: scrub: scrub in progress for 31h10m, 100.00% done, 0h0m to go It's 100% done, but yet hasn't finished in 10 hours! zpool iostat -v pool 10 shows it's doing between 50 and 120 MB/s of reads, when the userspace applications are only doing a few megabytes per second of I/O, as measured by the DTraceToolkit script rwtop (app_r: 4469 KB, app_w: 4579 KB). What can cause this kind of behavior, and how can I make my pool finish scrubbing? Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
Please see iostat -xen if there is transport or hw errors generated by say, device timeouts or bad cables etc. Consumer disks usually just timeout time to time while on load when RE-versions usually report error. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Simon Breden Sent: 2. syyskuuta 2009 17:34 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool I too see checksum errors ocurring for the first time using OpenSolaris 2009.06 on the /dev package repository at version snv_121. I see the problem occur within a mirrored boot pool (rpool) using SSDs. Hardware is AMD BE-2350 (ECC) processor with 4GB ECC memory on MCP55 chipset, although SATA is using mpt driver on a SuperMicro AOC-USAS-L8i controller card. More here: http://breden.org.uk/2009/09/02/home-fileserver-handling-pool-errors/ So I'm going to check my other boot environments to see if a rollback makes sense ( snv_121). Cheers, Simon -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] possible resilver bugs
Hi, I don't have means to replicate this issue nor file a bug about it so I'd like your opinion about these issues or perhaps make bug report if necessary. In scenario where is say three raidz2 groups consisting several disks, two disks fail in different raidz-groups. You have degraded pool and two degraded raidz2 groups. Now, one replaces first disk and starts resilvering, it takes day, two days, three days, counter says 100% resilvered but new data is still written to disk being replaced, counter SHOULD update if data amount increases in group. Before that first disk is resilvered, second failed disk in second group is replaced resulting in BOTH resilver-processes start from beginning making pool rather unusable due two resilvers and compromising pool for several days to come. Replacing disk in other raidz2-group should not interfere with ongoing resilvering on another disk set. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [n/zfs-discuss] Strange speeds with x4500, Solaris 10 10/08
btw, there's coming new Intel X25-M (G2) next month that will offer better random read/writes than E-series and seriously cheap pricetag, worth for a try I'd say. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Jorgen Lundman Sent: 30. heinäkuuta 2009 9:55 To: ZFS Discussions Subject: Re: [zfs-discuss] [n/zfs-discuss] Strange speeds with x4500, Solaris 10 10/08 Bob Friesenhahn wrote: Something to be aware of is that not all SSDs are the same. In fact, some faster SSDs may use a RAM write cache (they all do) and then ignore a cache sync request while not including hardware/firmware support to ensure that the data is persisted if there is power loss. Perhaps your fast CF device does that. If so, that would be really bad for zfs if your server was to spontaneously reboot or lose power. This is why you really want a true enterprise-capable SSD device for your slog. Naturally, we just wanted to try the various technologies to see how they compared. Store-bought CF card took 26s, store-bought SSD 48s. We have not found a PCI NVRam card yet. When talking to our Sun vendor, they have no solutions, which is annoying. X25-E would be good, but some pools have no spares, and since you can't remove vdevs, we'd have to move all customers off the x4500 before we can use it. CF card need reboot to see the cards, but 6 servers are x4500, not x4540, so not really a global solution. PCI NVRam cards need a reboot, but should work in both x4500 and x4540 without zpool rebuilding. But can't actually find any with Solaris drivers. Peculiar. Lund -- Jorgen Lundman | lund...@lundman.net Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hungs up forever...
I recently noticed that importing larger pools that are occupied by large amounts of data can do zpool import for several hours while zpool iostat only showing some random reads now and then and iostat -xen showing quite busy disk usage, It's almost it goes thru every bit in pool before it goes thru. Somebody said that zpool import got faster on snv118, but I don't have real information on that yet. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Victor Latushkin Sent: 29. heinäkuuta 2009 14:05 To: Pavel Kovalenko Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] zpool import hungs up forever... On 29.07.09 14:42, Pavel Kovalenko wrote: fortunately, after several hours terminal went back -- # zdb -e data1 Uberblock magic = 00bab10c version = 6 txg = 2682808 guid_sum = 14250651627001887594 timestamp = 1247866318 UTC = Sat Jul 18 01:31:58 2009 Dataset mos [META], ID 0, cr_txg 4, 27.1M, 3050 objects Dataset data1 [ZPL], ID 5, cr_txg 4, 5.74T, 52987 objects capacity operations bandwidth errors descriptionused avail read write read write read write cksum data1 5.74T 6.99T 772 0 96.0M 0 0 0 91 /dev/dsk/c14t0d05.74T 6.99T 772 0 96.0M 0 0 0 223 # So we know that there are some checksum errors there but at least zdb was able to open pool in read-only mode. i've tried to run zdb -e -t 2682807 data1 and #echo 0t::pid2proc|::walk thread|::findstack -v | mdb -k This is wrong - you need to put PID of the 'zpool import data1' process right after '0t'. and #fmdump -eV shows checksum errors, such as Jul 28 2009 11:17:35.386268381 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0x1baa23c52ce01c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x578154df5f3260c0 vdev = 0x6e4327476e17daaa (end detector) pool = data1 pool_guid = 0x578154df5f3260c0 pool_context = 2 pool_failmode = wait vdev_guid = 0x6e4327476e17daaa vdev_type = disk vdev_path = /dev/dsk/c14t0d0p0 vdev_devid = id1,s...@n2661000612646364/q parent_guid = 0x578154df5f3260c0 parent_type = root zio_err = 50 zio_offset = 0x2313d58000 zio_size = 0x4000 zio_objset = 0x0 zio_object = 0xc zio_level = 0 zio_blkid = 0x0 __ttl = 0x1 __tod = 0x4a6ea60f 0x1705fcdd This tells us that object 0xc in metabjset (objset 0x0) is corrupted. So to get more details you can do the following: zdb -e - data1 zdb -e -bbcs data1 victor ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs destroy slow?
Hi, how come zfs destroy being so slow, eg. destroying 6TB dataset renders zfs admin commands useless for time being, in this case for hours? (running osol 111b with latest patches.) Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy slow?
Oh well, whole system seems to be deadlocked. nice. Little too keen keeping data safe :-P Yours Markus Kovero From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Markus Kovero Sent: 27. heinäkuuta 2009 13:39 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] zfs destroy slow? Hi, how come zfs destroy being so slow, eg. destroying 6TB dataset renders zfs admin commands useless for time being, in this case for hours? (running osol 111b with latest patches.) Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] No files but pool is full?
During our tests we noticed very disturbing behavior, what would be causing this? System is running latest stable opensolaris. Any other means to remove ghost files rather than destroying pool and restoring from backups? r...@~# zpool status testpool pool: testpool state: ONLINE scrub: scrub completed after 0h26m with 0 errors on Fri Jul 24 10:32:09 2009 config: NAME STATE READ WRITE CKSUM testpoolONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t5000C5000505C31Bd0 ONLINE 0 0 0 c0t5000C5000498A9D3d0 ONLINE 0 0 0 c0t5000C5000505B523d0 ONLINE 0 0 0 c0t5000C5000505BB83d0 ONLINE 0 0 0 c0t5000C5000505B727d0 ONLINE 0 0 0 c0t5000C50004987B6Bd0 ONLINE 0 0 0 errors: No known data errors r...@~# zpool list testpool NAME SIZE USED AVAILCAP HEALTH ALTROOT testpool 408G 402G 6.37G98% ONLINE - r...@~# ls -lasht /testpool/ total 4.0K 1.5K drwxr-xr-x 29 root root 30 2009-07-24 09:56 .. 2.5K drwxr-xr-x 2 root root 2 2009-07-23 18:23 . r...@~# df /testpool Filesystem 1K-blocks Used Available Use% Mounted on testpool 280481377 280481377 0 100% /testpool r...@~# df -i /testpool FilesystemInodes IUsed IFree IUse% Mounted on testpool7 7 0 100% /testpool r...@~# zdb - testpool ... Object lvl iblk dblk lsize asize type 6516K 128K 1000G 262G ZFS plain file 264 bonus ZFS znode path???object#6 uid 0 gid 0 atime Thu Jul 23 17:23:19 2009 mtime Thu Jul 23 17:50:17 2009 ctime Thu Jul 23 17:50:17 2009 crtime Thu Jul 23 17:23:19 2009 gen 19 mode100600 size1073741824000 parent 3 links 0 xattr 0 rdev0x Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No files but pool is full?
r...@~# zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/opensola...@install 146M - 2.82G - r...@~# -Original Message- From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of Mattias Pantzare Sent: 24. heinäkuuta 2009 10:56 To: Markus Kovero Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] No files but pool is full? On Fri, Jul 24, 2009 at 09:33, Markus Koveromarkus.kov...@nebula.fi wrote: During our tests we noticed very disturbing behavior, what would be causing this? System is running latest stable opensolaris. Any other means to remove ghost files rather than destroying pool and restoring from backups? You may have snapshots, try: zfs list -t snapshot ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No files but pool is full?
Yes, server has been rebooted several times and there is no available space, is it possible to delete ghosts that zdb sees somehow? how this can happen? Yours Markus Kovero -Original Message- From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of Mattias Pantzare Sent: 24. heinäkuuta 2009 11:22 To: Markus Kovero Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] No files but pool is full? On Fri, Jul 24, 2009 at 09:57, Markus Koveromarkus.kov...@nebula.fi wrote: r...@~# zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/opensola...@install 146M - 2.82G - r...@~# Then it is probably some process that has a deleted file open. You can find those with: fuser -c /testpool But if you can't find the space after a reboot something is not right... -Original Message- From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of Mattias Pantzare Sent: 24. heinäkuuta 2009 10:56 To: Markus Kovero Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] No files but pool is full? On Fri, Jul 24, 2009 at 09:33, Markus Koveromarkus.kov...@nebula.fi wrote: During our tests we noticed very disturbing behavior, what would be causing this? System is running latest stable opensolaris. Any other means to remove ghost files rather than destroying pool and restoring from backups? You may have snapshots, try: zfs list -t snapshot ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] No files but pool is full?
Hi, thanks for pointing out issue, we haven't run updates on server yet. Yours Markus Kovero -Original Message- From: Henrik Johansson [mailto:henr...@henkis.net] Sent: 24. heinäkuuta 2009 12:26 To: Markus Kovero Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] No files but pool is full? On 24 jul 2009, at 09.33, Markus Kovero markus.kov...@nebula.fi wrote: During our tests we noticed very disturbing behavior, what would be causing this? System is running latest stable opensolaris. Any other means to remove ghost files rather than destroying pool and restoring from backups? This looks like bug i filed a while ago, CR 6792701 removing large holey files does bot free space. The only solution I found to clean the pool when isolating the bug was to recreate it. The fix was integrated inbuild post OSOL 2009.06. Mkfile of a certain size will trigger this. Henrik http://sparcv9.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
I would be intrested in how to roll-back to certain txg-points in case of disaster, that was what Russel was after anyway. Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Miles Nordin Sent: 19. heinäkuuta 2009 11:24 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work bj == Brent Jones br...@servuhome.net writes: bj many levels of fail here, pft. Virtualbox isn't unstable in any of my experience. It doesn't by default pass cache flushes from guest to host unless you set VBoxManage setextradata VMNAME VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush 0 however OP does not mention the _host_ crashing, so this questionable ``optimization'' should not matter. Yanking the guest's virtual cord is something ZFS is supposed to tolerate: remember the ``crash-consistent backup'' concept (not to mention the ``always consistent on disk'' claim, but really any filesystem even without that claim should tolerate having the guest's virtual cord yanked, or the guest's kernel crashing, without losing all its contents---the claim only means no time-consuming fsck after reboot). bj to blame ZFS seems misplaced, -1 The fact that it's a known problem doesn't make it not a problem. bj the subject on this thread especially inflammatory. so what? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss