An additional data point, when i try to do a zdb -e -d and find the
incomplete zfs recv snapshot I get an error as follows:

# sudo zdb -e -d xxx-yy-01 | grep "%"
Could not open xxx-yy-01/aaa-bb-01/aaa-bb-01-01/%1309906801, error 16
#

Anyone know what error 16 means from zdb and how this might impact
importing this zpool ?

On Wed, Aug 3, 2011 at 9:19 AM, Paul Kraus <p...@kraus-haus.org> wrote:
>    I am having a very odd problem, and so far the folks at Oracle
> Support have not provided a working solution, so I am asking the crowd
> here while still pursuing it via Oracle Support.
>
>    The system is a T2000 running 10U9 with CPU-2010-01and two J4400
> loaded with 1 TB SATA drives. There is one zpool on the J4400 (3 x 15
> disk vdev + 3 hot spare). This system is the target for zfs send /
> recv replication from our production server.The OS is UFS on local
> disk.
>
>     While I was on vacation this T2000 hung with "out of resource"
> errors. Other staff tried rebooting, which hung the box. Then they
> rebooted off of an old BE (10U9 without CPU-2010-01). Oracle Support
> had them apply a couple patches and an IDR to address zfs "stability
> and reliability problems" as well as set the following in /etc/system
>
> set zfs:zfs_arc_max = 0x700000000 (which is 28 GB)
> set zfs:arc_meta_limit = 0x700000000 (which is 28 GB)
>
>    The system has 32 GB RAM and 32 (virtual) CPUs. They then tried
> importing the zpool and the system hung (after many hours) with the
> same "out of resource" error. At this point they left the problem for
> me :-(
>
>    I removed the zfs.cache from the 10U9 + CPU 2010-10 BE and booted
> from that. I then applied the IDR (IDR146118-12 )and the zfs patch it
> depended on (145788-03). I did not include the zfs arc and zfs arc
> meta limits as I did not think they relevant. A zpool import shows the
> pool is OK and a sampling with zdb -l of the drives shows good labels.
> I started importing the zpool and after many hours it hung the system
> with "out of resource" errors. I had a number of tools running to see
> what was going on. The only thing this system is doing is importing
> the zpool.
>
> ARC had climbed to about 8 GB and then declined to 3 GB by the time
> the system hung. This tells me that there is something else consuming
> RAM and the ARC is releasing it.
>
> The hung TOP screen showed the largest user process only had 148 MB
> allocated (and much less resident).
>
> VMSTAT showed a scan rate of over 900,000 (NOT a typo) and almost 8 GB
> of free swap (so whatever is using memory cannot be paged out).
>
>    So my guess is that there is a kernel module that is consuming all
> (and more) of the RAM in the box. I am looking for a way to query how
> much RAM each kernel module is using and script that in a loop (which
> will hang when the box runs out of RAM next). I am very open to
> suggestions here.
>
>   Since this is the recv end of replication, I assume there was a zfs
> recv going on at the time the system initially hung. I know there was
> a 3+ TB snapshot replicating (via a 100 Mbps WAN link) when I left for
> vacation, that may have still been running. I also assume that any
> partial snapshots (% instead of @) are being removed when the pool is
> imported. But what could be causing a partial snapshot removal, even
> of a very large snapshot, to run the system out of RAM ? What caused
> the initial hang of the system (I assume due to out of RAM) ? I did
> not think there was a limit to the size of either a snapshot or a zfs
> recv.
>
> Hung TOP screen:
>
> load averages: 91.43, 33.48, 18.989             xxx-xxx1               
> 18:45:34
> 84 processes:  69 sleeping, 12 running, 1 zombie, 2 on cpu
> CPU states: 95.2% idle,  0.5% user,  4.4% kernel,  0.0% iowait,  0.0% swap
> Memory: 31.9G real, 199M free, 267M swap in use, 7.7G swap free
>
>   PID USERNAME THR PR NCE  SIZE   RES STATE   TIME FLTS    CPU COMMAND
>   533 root      51 59   0  148M 30.6M run   520:21    0  9.77% java
>  1210 yyyyyy     1  0   0 5248K 1048K cpu25   2:08    0  2.23% xload
>  14720 yyyyyy     1 59   0 3248K 1256K cpu24   1:56    0  0.03% top
>   154 root       1 59   0 4024K 1328K sleep   1:17    0  0.02% vmstat
>  1268 yyyyyy     1 59   0 4248K 1568K sleep   1:26    0  0.01% iostat
> ...
>
> VMSTAT:
>
> kthr      memory            page            disk          faults      cpu
>  r b w   swap  free  re  mf pi po fr de sr m0 m1 m2 m3   in   sy   cs us sy id
>  0 0 112 8117096 211888 55 46 0 0 425 0 912684 0 0 0 0  976  166  836  0  2 98
>  0 0 112 8117096 211936 53 51 6 0 394 0 926702 0 0 0 0  976  167  833  0  2 98
>
> ARC size (B): 4065882656
>
> --
> {--------1---------2---------3---------4---------5---------6---------7---------}
> Paul Kraus
> -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
> -> Sound Designer: Frankenstein, A New Musical
> (http://www.facebook.com/event.php?eid=123170297765140)
> -> Sound Coordinator, Schenectady Light Opera Company (
> http://www.sloctheater.org/ )
> -> Technical Advisor, RPI Players
>



-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Designer: Frankenstein, A New Musical
(http://www.facebook.com/event.php?eid=123170297765140)
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to