> I'm getting sub-optimal performance with an mmap based database
> (mongodb) which is running on zfs of Solaris 10u9.
> System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 *
> ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks
> - a few mongodb instances are running with with moderate IO and total
> rss of 50 GB
> - a service which logs quite excessively (5GB every 20 mins) is also
> running (max 2GB ram use) - log files are compressed after some
> time to bzip2.
> Database performance is quite horrid though - it seems that zfs does not
> know how to manage allocation between page cache and arc cache - and it
> seems arc cache wins most of the time.
Or to be more accurate, there is no coordination that I am aware of between the
VM page cache and the ARC. Which, for all the glories of ZFS, strikes me as a
*doh*face-in-palm* how-did-we-miss-this sorta thing. One of these days I need
to ask Jeff and Bill what they were thinking.
We went through this 9 months ago - we wrote MongoDB, which attempted to mmap()
whole database files for the purpose of skimming back and forth through them
quickly (think column-oriented database). Performance, um, sucked.
There is a practical limit to the amount of RAM you can shove into a machine -
and said RAM gets slower as you have to go to quad-rank DIMMs, which Nehalem
can't run at full speed - for the sort of box you speak of, your top end of
1333Mhz is 96G, last I checked. (We're at 192G in most cases.) So while yes
copying the data around between VM and ARC is doable, in large quantities that
are invariably going to blow the CPU L3, this may not be the most practical
It didn't help of course that
a) said DB was implemented in Java - _please_ don't ask - which is hardly a
poster child for implementing any form of mmap(), not to mention spins a ton of
b) said machine _started_ with 72 2TB Constellations and a pack of Cheetahs
arranged in 7 pools, resulting in ~700 additional kernel threads roaming
around, all of which got woken up on any heavy disk access (yes they could have
all been in one pool - and yes there is a specific reason for not doing so)
but and still.
We managed to break ZFS as a result. There are a couple of cases filed. One is
semi-patched, the other we're told simply can't be fixed in Solaris 10.
Fortunately we understand the conditions that create the breakage, and work
around it by Just Not Doing That(tm). In your configuration, I can almost
guarantee you will not run into them.
> I'm thinking of doing the following:
> - relocating mmaped (mongo) data to a zfs filesystem with only
> metadata cache
> - reducing zfs arc cache to 16 GB
> Is there any other recommendations - and is above likely to improve
Well... we ended up
(a) rewriting MongoDB to use in-process "buffer workspaces" and read()/write()
to fill/dump the buffers to disk (essentially, giving up on mmap())
(b) moving most of the workload to CentOS and using the Solaris boxes as big
fast NFSv3 fileservers (NFSv4 didn't work out so well for us) over 10G, because
for most workloads it runs 5-8% faster on CentOS than Solaris, and we're
primarily a CentOS shop anyway so it was just easier for everyone to deal with
- but this has little to do with mmap() difficulties
Given what I know of the Solaris VM, VFS and of ZFS as implemented - admittedly
incomplete, and my VM knowledge is based mostly on SVR4 - it would seem to me
that it is going to take some Really Creative Thinking to work around the
mmap() problem - a tweak or two ain't gonna cut it.
zfs-discuss mailing list