2012-06-14 19:11, tpc...@mklab.ph.rhul.ac.uk wrote:

In message <201206141413.q5eedvzq017...@mklab.ph.rhul.ac.uk>, tpc...@mklab.ph.r
hul.ac.uk writes:
Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap
My WAG is that your "zpool history" is hanging due to lack of
RAM.

Interesting.  In the problem state the system is usually quite responsive, eg. 
not memory trashing.  Under Linux which I'm more
familiar with the 'used memory' = 'total memory - 'free memory', refers to 
physical memory being used for data caching by
the kernel which is still available for processes to allocate as needed 
together with memory allocated to processes, as opposed to
only physical memory already allocated and therefore really 'used'.  Does this 
mean something different under Solaris ?

Well, it is roughly similar. In Solaris there is a general notion
of "swap" or "virtual memory" so as not to confuse adepts of other
systems, which is a general combination of "RAM" and "disk swap"
spaces. Tools imported from other environments, like "top" above,
use the common notions of "physical memory" and "on-disk swap";
tools like "vmstat" under Solaris would print the "swap = VM" and
the "free = RAM" columns...

Processes are allocated their memory requirements from the generic
"swap = virt.mem", though some tricks are possible - some pages
may be marked as not "swappable" to disk, others may require a
reservation of on-disk swap space even if all the data still
lives in RAM. Kernel memory, for example, that used by ZFS, does
not go into on-disk swap (which can cause system freezes due to
shortage of RAM for operations if some big ZFS task is not ready
to just release that virtual memory).

The ZFS ARC cache may release its memory "on request" for RAM
from other processes, but this takes some time (and some programs
check for lack of free memory and think they can't get more,
and break without even trying), so a reserve of free memory
is usually kept by the OS. To have the free RAM go as low as
the 32Mb low watermark, some strong hammering must be going on...

Now, back to the 2Gb RAM problem: ZFS has lots of metadata.
Both reads and writes to the pool have to traverse a large tree
of block pointers, with leaves of the tree containing pieces of
your "user-data". Updates to user-data cause rewriting of the
whole path through the tree from updated blocks to the root
(metadata blocks must be read, modified, re-checksummed at
their parents - recurse to root).

Metadata blocks are also stored on the disk, but in several
copies per block (double-triple the IOPS cost).

ZFS works fast when the "hot" paths through the needed portions
of the blockpointer tree, or, even better, the whole tree, are
cached into RAM. Otherwise, the least-used blocks are evicted
to accomodate the recent newcomers. If you are low on RAM and
useful blocks get evicted, this causes re-reads from disk to
get them back (and evict some others), which may cause the lags
you're seeing. The high part of kernel time also indicates that
it is not some userspace computation hogging the CPUs, but
likely waiting for hardware IOs.

Running "iostat 1" or "zpool iostat 1" can help you see some
patterns (at least, whether there are many disk reads when the
system is "hung"). Perhaps the pool is getting scrubbed, or
the slocate database gets updated, or some machines begin
dumping their backups onto the fileserver at once - and
with so little cache the machine nearly dies, in terms of
performance and responsiveness at least.

This lack of RAM is especially deadly upon writes into
deduped pools, because DDT tables tend to be large (tens
of GBs for moderate-sized pools of tens of TB).
Your box seems to have a 12Tb pool with just a little bit
used, yet already the shortage of RAM is well seen...

Hope this helps (understanding at least),
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to