Re: [zfs-discuss] Reading ZFS config for an extended period
I have 1TB mirror deduplicated pool. snv_134 runned on x86 i7 PC with 8GB RAM I destroyed 30GB zfs volume and now trying to import that pool at the LiveUSB runned osol. It works 2h already, I'm waiting ... How can I see some progressbar or another signs of current import job? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
- Original Message - I have 1TB mirror deduplicated pool. snv_134 runned on x86 i7 PC with 8GB RAM I destroyed 30GB zfs volume and now trying to import that pool at the LiveUSB runned osol. It works 2h already, I'm waiting ... It may even take longer. I've seen this take a while. It's a known bug. The fix is not to use dedup... How can I see some progressbar or another signs of current import job? You can't. If you reboot, the system will likely hang until the volume is removed. It should be possible to take the system up in single user mode, but you should probably just wait Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
It finished! Going to switch off dedup ... if it's possible yet -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
Ugh! If you received a direct response to me instead of via the list, apologies for that. Rob: I'm just reporting the news. The RFE is out there. Just like SLOGs, I happen to think it a good idea, personally, but that's my personal opinion. If it makes dedup more usable, I don't see the harm. Taemun: The issue, as I understand it, is not use-lots-of-cpu or just dies from paging. I believe it is more to do with all of the small, random reads/writes in updating the DDT. Remember, the DDT is stored within the pool, just as the ZIL is if you don't have a SLOG. (The S in SLOG standing for separate.) So all the DDT updates are in competition for I/O with the actual data deletion. If the DDT could be stored as a separate VDEV already, I'm sure a way would have been hacked together by someone (likely someone on this list). Hence, the need for the RFE to create this functionality where it does not currently exist. The DDT is separate from the ARC or L2ARC. Here's the bug: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566 If I'm incorrect, someone please let me know. Markus: Yes, the issue would appear to be dataset size vs. RAM size. Sounds like an area ripe for testing, much like RAID Z3 performance. Cheers all! On Tue, Feb 16, 2010 at 00:20, taemun tae...@gmail.com wrote: The system in question has 8GB of ram. It never paged during the import (unless I was asleep at that point, but anyway). It ran for 52 hours, then started doing 47% kernel cpu usage. At this stage, dtrace stopped responding, and so iopattern died, as did iostat. It was also increasing ram usage rapidly (15mb / minute). After an hour of that, the cpu went up to 76%. An hour later, CPU usage stopped. Hard drives were churning throughout all of this (albeit at a rate that looks like each vdev is being controller by a single threaded operation). I'm guessing that if you don't have enough ram, it gets stuck on the use-lots-of-cpu phase, and just dies from too much paging. Of course, I have absolutely nothing to back that up. Personally, I think that if L2ARC devices were persistent, we already have the mechanism in place for storing the DDT as a seperate vdev. The problem is, there is nothing you can run at boot time to populate the L2ARC, so the dedup writes are ridiculously slow until the cache is warm. If the cache stayed warm, or there was an option to forcibly warm up the cache, this could be somewhat alleviated. Cheers -- You can choose your friends, you can choose the deals. - Equity Private If Linux is faster, it's a Solaris bug. - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
k == Khyron khyron4...@gmail.com writes: k The RFE is out there. Just like SLOGs, I happen to think it a k good idea, personally, but that's my personal opinion. If it k makes dedup more usable, I don't see the harm. slogs and l2arcs, modulo the current longstanding ``cannot import pool with attached missing slog'' bug, are disposeable: You will lose either a little data or no data if the device goes away (once the bug is finally fixed). This makes them less ponderous because these days we are looking for raidz2 or raidz3 amount of redundancy, so in a seperate device that wasn't disposeable we'd need a 3- or 4-way mirror. It also makes their seperateness more seperable since they can go away at any time, so maybe they do deserve to be seperate. The two together make the complexity more bearable. Would an sddt be disposeable, or would it be a critical top-level vdev needed for import? If it's critical, well, that's kind of annoying, because now we need 3-way mirrors of sddt to match the minimum best-practice redundancy of the rest of the pool's redundancy, and my first reaction would be ``can you spread it throughout the normal raidz{,2,3} vdevs at least in backup form?'' once I say a copy should be kept in the main pool even afer it becomes an sddt, well, what's that imply? * In the read case it means cacheing, so it could go in the l2arc. How's DDT different from anything else in the l2arc? * In the write case it means sometimes commiting it quickly without waiting on the main pool so we can release some lock or answer some RPC and continue. Why not write it to the slog? Then if we lose the slog we can do what we always do without the slog and roll back to the last valid txg, losing whatever writes were associated with that lost ddt update. The two cases fit fine with the types of SSD's we're using for each role and the type of special error recovery we have if we lose the device. Why litter a landscape so full of special cases and tricks (like the ``import pool with missing slog'' bug that is taking so long to resolve) with yet another kind of vdev that will take 1 year to discover special cases and a halfdecade to address them? Maybe there is a reason. Are DDT write patterns different than slog write patterns? Is it possible to make a DDT read cache using less ARC for pointers than the l2arc currently uses? Is the DDT particularly hard on the l2arc by having small block sizes? Will the sddt be delivered with a separate offline ``not an fsck!!!'' tool for slowly regenerating it from pool data if it's lost, or maybe after an sddt goes bad the pool can be mounted space-wastingly as in like no dedup is done and deletes do not free space, with an empty DDT, and the sddt regenerated by a scrub? If the performance or recovery behavior is different than what we're working towards with optional-slog and persistent-l2arc then maybe sddt does deserve to be antoher vdev type. soi dunno. On one hand I'm clearly nowhere near informed enough to weigh in on an architectural decision like this and shouldn't even be discussing it, and the same applies to you Khyron, to my view, since our input seems obvious at best and misinformed at worst. On the other hand, other major architectural changes (slog) was delivered incomplete in a cripplingly bad and silly, trivial way for, AFAICT, nothing but lack of sufficient sysadmin bitching and moaning, leaving heaps of multi-terabyte naked pools out there for half a decade with fancy triple redundancy that will be totally lost if a single SSD + zpool.cache goes bad, so apparently thinking things through even at this trivial level might have some value to the ones actually doing the work. pgp8j6Y2dtxrq.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
Just thought I'd chime in for anyone who had read this - the import operation completed this time, after 60 hours of disk grinding. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
The DDT is stored within the pool, IIRC, but there is an RFE open to allow you to store it on a separate top level VDEV, like a SLOG. The other thing I've noticed with all of the destroyed a large dataset with dedup enabled and it's taking forever to import/destory/insert function here questions is that the process runs so so so much faster with 8+ GiB of RAM. Almost to a man, everyone who reports these 3, 4, or more day destroys has 8 GiB of RAM on the storage server. Just some observations/thoughts. On Mon, Feb 15, 2010 at 23:14, taemun tae...@gmail.com wrote: Just thought I'd chime in for anyone who had read this - the import operation completed this time, after 60 hours of disk grinding. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- You can choose your friends, you can choose the deals. - Equity Private If Linux is faster, it's a Solaris bug. - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
RFE open to allow you to store [DDT] on a separate top level VDEV hmm, add to this spare, log and cache vdevs, its to the point of making another pool and thinly provisioning volumes to maintain partitioning flexibility. taemun: hay, thanks for closing the loop! Rob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
The system in question has 8GB of ram. It never paged during the import (unless I was asleep at that point, but anyway). It ran for 52 hours, then started doing 47% kernel cpu usage. At this stage, dtrace stopped responding, and so iopattern died, as did iostat. It was also increasing ram usage rapidly (15mb / minute). After an hour of that, the cpu went up to 76%. An hour later, CPU usage stopped. Hard drives were churning throughout all of this (albeit at a rate that looks like each vdev is being controller by a single threaded operation). I'm guessing that if you don't have enough ram, it gets stuck on the use-lots-of-cpu phase, and just dies from too much paging. Of course, I have absolutely nothing to back that up. Personally, I think that if L2ARC devices were persistent, we already have the mechanism in place for storing the DDT as a seperate vdev. The problem is, there is nothing you can run at boot time to populate the L2ARC, so the dedup writes are ridiculously slow until the cache is warm. If the cache stayed warm, or there was an option to forcibly warm up the cache, this could be somewhat alleviated. Cheers ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
The other thing I've noticed with all of the destroyed a large dataset with dedup enabled and it's taking forever to import/destory/insert function here questions is that the process runs so so so much faster with 8+ GiB of RAM. Almost to a man, everyone who reports these 3, 4, or more day destroys has 8 GiB of RAM on the storage server. I've witnessed destroys that take several days with 24GB+ systems (dataset over 30TB). I guess it's just matter of how large datasets vs. how much ram. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
After around four days the process appeared to have stalled (no audible hard drive activity). I restarted with milestone=none; deleted /etc/zfs/zpool.cache, restarted, and went zpool import tank. (also allowed root login to ssh, so I could make new ssh sessions if required.) Now I can watch the process from on the machine. My present question is how is the DDT stored? I believe the DDT to have around 10M entries for this dataset, as per: DDT-sha256-zap-duplicate: 400478 entries, size 490 on disk, 295 in core DDT-sha256-zap-unique: 10965661 entries, size 381 on disk, 187 in core (taken just previous to the attempt to destroy the dataset) A sample from iopattern shows: %RAN %SEQ COUNTMINMAXAVG KR 1000195512512512 97 1000414512 65536895362 1000261512512512130 1000273512512512136 1000247512512512123 1000297512512512148 1000292512512512146 1000250512512512125 1000274512512512137 1000302512512512151 1000294512512512147 1000308512512512154 982286512512512143 1000270512512512135 1000390512512512195 1000269512512512134 1000251512512512125 1000254512512512127 1000265512512512132 1000283512512512141 As the pool is comprised of 2x 8-disk raidz vdevs, I presume that each element is stored twice (for the raidz redundancy). So around 280 512b read op/s, that's 140 entries per second. Is the import of a semi-broken pool: 1 Reading all the DDT markers for the dataset; or 2 Reading all the DDT markers for the pool; or 3 Reading all of the block markers for the dataset; or 4 Reading all of the block markers for the pool Prior to actually finalising what it needs to do to fix the pool? I'd like to be able to estimate the length of time likely before the import finishes. Or should I tell it to roll back to the last valid txg - ie before the zfs destroy dataset command was issued? (by zpool import -F.) Or is this likely to take as long/longer than the present import/fix? Cheers. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
On 02/11/10 08:15, taemun wrote: Can anyone comment about whether the on-boot Reading ZFS confi is any slower/better/whatever than deleting zpool.cache, rebooting and manually importing? I've been waiting more than 30 hours for this system to come up. There is a pool with 13TB of data attached. The system locked up whilst destroying a 934GB dedup'd dataset, and I was forced to reboot it. I can hear hard drive activity presently - ie its doing bsomething/b, but am really hoping there is a better way :) Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I think that this is a consequence of 6924390.- http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6924390 ZFS destroy on de-duped dataset locks all I/O This bug is closed as a dup of another bug which is not readable from the opensolaris site, (I'm not clear what makes some bugs readable and some not). While trying to reproduce 6924390 (or its equivalent) yesterday, my system hung as yours did, and when I rebooted, it hung at Reading ZFS config. Someone who knows more about the root cause of this situation (i.e., the bug named above) might be able tell you what's going on and how to recover (it might be that what's going on is that the destroy has resumed and you have to wait for it to complete, which I think it will, but it might take a long time). Lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
On 02/11/10 10:33, Lori Alt wrote: This bug is closed as a dup of another bug which is not readable from the opensolaris site, (I'm not clear what makes some bugs readable and some not). the other bug in question was opened yesterday and probably hasn't had time to propagate. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
Do you think that more RAM would help this progress faster? We've just hit 48 hours. No visible progress (although that doesn't really mean much). It is presently in a system with 8GB of ram, I could try to move the pool across to a system with 20GB of ram, if that is likely to expedite the process. Of course, if it isn't going to make any difference, I'd rather not restart this process. Thanks On 12 February 2010 06:08, Bill Sommerfeld sommerf...@sun.com wrote: On 02/11/10 10:33, Lori Alt wrote: This bug is closed as a dup of another bug which is not readable from the opensolaris site, (I'm not clear what makes some bugs readable and some not). the other bug in question was opened yesterday and probably hasn't had time to propagate. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss