2011-06-10 13:51, Jim Klimov пишет:
and the system dies in
swapping hell (scanrates for available pages were seen to go
into millions, CPU context switches reach 200-300k/sec on a
single dualcore P4) after eating the last stable-free 1-2Gb
of RAM within a minute. After this the system responds to
nothing except the reset button.


I've captured an illustration for this today, with my watchdog as
well as vmstat, top and other tools. Half a gigabyte in under one
second - the watchdog never saw it coming :(

My "freeram-watchdog" is based on vmstat but emphasizes
deltas in "freeswap" and "freeram" values (see middle columns)
and has less fields with more readable names ;)

freq freeswap freeram scanrate Dswap Dram in sy cs us sy id
1 6652236 497088 0 0 -5380 3428 3645 3645 0 83 17
1 6652236 502112 0 0 5024 2332 2962 2962 1 72 27
1 6652236 494656 0 0 -7456 2886 3641 3641 0 78 21
1 6652236 502024 0 0 7368 3748 4197 4197 1 83 16
1 6652236 502316 0 0 292 4090 2516 2516 0 68 32
1 6652236 498388 0 0 -3928 2270 3940 3940 1 76 24
1 6652236 502264 0 0 3876 3097 3097 3097 0 76 23
1 6652236 495052 0 0 -7212 2705 2796 2796 1 86 14
1 6652236 502384 0 0 7332 3609 4449 4449 1 81 18
1 6652236 502292 0 0 -92 3639 2639 2639 1 80 19
1 6652236 92064 3435680 0 -410228 15665 1312 1312 0 99 0

In VMSTAT it looked like this:
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s2 s3 s4 in sy cs us sy id
0 0 0 6652236 495052 0 32 0 0 0 0 0 0 0 0 0 3037 4107 158598 0 90 10
0 0 0 6652236 502384 0 24 0 0 0 0 0 0 0 0 0 3266 2697 114195 1 78 21
6 0 0 6652236 502292 0 23 0 0 0 0 0 0 15 15 15 2947 3048 130070 0 87 13
29 0 0 6652236 92064 124 155 0 0 5084 0 3706374 0 0 0 0 16743 1244 2696 0 100 0

So, for a couple of minutes before the freeze, the system was rather
stable at around 500mb free RAM. Before this couple of minutes it
was stable at 1-1.2Gb, but then jumped down to 500m in about 10sec.

And in the last second of known uptime, the system ate up at least
400Mb and began scanning for free pages at 3.7Mil scans/sec.
Usually reaching this condition takes about 3-5 seconds, and
I see "cs" going up to about 200k, and my watchdog has time
to reboot the system. Not this time :(

According to TOP, the free RAM dropped down to 32Mb (which in
my older adventures was also the empirical lower limit of RAM when
the system began scanrating to death), with zpool process ranking
high - but I haven't yet seen pageout make it to top in its last second
of life:

last pid: 1786; load avg: 3.59, 2.10, 1.09; up 0+02:20:28 15:07:20
118 processes: 100 sleeping, 16 running, 2 on cpu
CPU states: 0.0% idle, 0.4% user, 99.6% kernel, 0.0% iowait, 0.0% swap
Kernel: 2807 ctxsw, 210 trap, 16730 intr, 1388 syscall, 161 flt
Memory: 8191M phys mem, 32M free mem, 6655M total swap, 6655M free swap

PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND
1464 root 138 99 -20 0K 0K sleep 2:01 30.93% zpool-dcpool
2 root 2 97 -20 0K 0K cpu/1 0:03 26.67% pageout
1220 root 1 59 0 4400K 2188K sleep 1:04 0.57% prstat
1477 root 1 59 0 2588K 1756K run 3:13 0.28% freeram-watchdo
3 root 1 60 -20 0K 0K sleep 0:17 0.21% fsflush
522 root 1 59 0 4172K 1000K run 0:35 0.20% top


One way or another, such repeatable failure behaviour is very
not acceptable for a production storage platform :( and I hope
to see it fixed - if I can help somehow?..

I *think* one way to reproduce it would be:
1) Enable dedup (optional?)
2) Write lots of data to disk, i.e. 2-3Tb
3) Delete lots of data, or make and destroy a snapshot,
or destroy a dataset with test data

This puts the system into position with lots of processing of
(not-yet-)deferred deletes.

In my case this by itself often leads to RAM starvation and
hangs and a following reset of the box; you can reset the
TEST system during such delete processing.

Now when you reboot and try to import this test pool, you
should have a situation like mine - the pool does not quckly
import, zfs-related commands hang, and in a few hours
the box should die ;)

iostat reports many small reads and occasional writes
(starting after about 10 minutes into import) which gives
me hope that the pool will come back online sometime...

The current version of my software watchdog which saves some
trouble for my assistant by catching near-freeze conditions,
is here:

* http://thumper.cos.ru/~jim/freeram-watchdog-20110610-v0.11.tgz



--


+============================================================+
|                                                            |
| Климов Евгений,                                 Jim Klimov |
| технический директор                                   CTO |
| ЗАО "ЦОС и ВТ"                                  JSC COS&HT |
|                                                            |
| +7-903-7705859 (cellular)          mailto:jimkli...@cos.ru |
|                          CC:ad...@cos.ru,jimkli...@mail.ru |
+============================================================+
| ()  ascii ribbon campaign - against html mail              |
| /\                        - against microsoft attachments  |
+============================================================+



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to