Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-06-01 Thread Jeff Bacon
 I'm getting sub-optimal performance with an mmap based database
 (mongodb) which is running on zfs of Solaris 10u9.
 
 System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 *
 4GB)
 ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks
 
   - a few mongodb instances are running with with moderate IO and total
 rss of 50 GB
   - a service which logs quite excessively (5GB every 20 mins) is also
 running (max 2GB ram use) - log files are compressed after some
 time to bzip2.
 
 Database performance is quite horrid though - it seems that zfs does not
 know how to manage allocation between page cache and arc cache - and it
 seems arc cache wins most of the time.

Or to be more accurate, there is no coordination that I am aware of between the 
VM page cache and the ARC. Which, for all the glories of ZFS, strikes me as a 
*doh*face-in-palm* how-did-we-miss-this sorta thing. One of these days I need 
to ask Jeff and Bill what they were thinking. 

We went through this 9 months ago - we wrote MongoDB, which attempted to mmap() 
whole database files for the purpose of skimming back and forth through them 
quickly (think column-oriented database). Performance, um, sucked. 

There is a practical limit to the amount of RAM you can shove into a machine - 
and said RAM gets slower as you have to go to quad-rank DIMMs, which Nehalem 
can't run at full speed - for the sort of box you speak of, your top end of 
1333Mhz is 96G, last I checked. (We're at 192G in most cases.) So while yes 
copying the data around between VM and ARC is doable, in large quantities that 
are invariably going to blow the CPU L3, this may not be the most practical 
answer.

It didn't help of course that 
a) said DB was implemented in Java - _please_ don't ask - which is hardly a 
poster child for implementing any form of mmap(), not to mention spins a ton of 
threads
b) said machine _started_ with 72 2TB Constellations and a pack of Cheetahs 
arranged in 7 pools, resulting in ~700 additional kernel threads roaming 
around, all of which got woken up on any heavy disk access (yes they could have 
all been in one pool - and yes there is a specific reason for not doing so)

but and still. 

We managed to break ZFS as a result. There are a couple of cases filed. One is 
semi-patched, the other we're told simply can't be fixed in Solaris 10. 
Fortunately we understand the conditions that create the breakage, and work 
around it by Just Not Doing That(tm). In your configuration, I can almost 
guarantee you will not run into them. 


 
 I'm thinking of doing the following:
   - relocating mmaped (mongo) data to a zfs filesystem with only
 metadata cache
   - reducing zfs arc cache to 16 GB
 
 Is there any other recommendations - and is above likely to improve
 performance.

Well... we ended up 
(a) rewriting MongoDB to use in-process buffer workspaces and read()/write() 
to fill/dump the buffers to disk (essentially, giving up on mmap())
(b) moving most of the workload to CentOS and using the Solaris boxes as big 
fast NFSv3 fileservers (NFSv4 didn't work out so well for us) over 10G, because 
for most workloads it runs 5-8% faster on CentOS than Solaris, and we're 
primarily a CentOS shop anyway so it was just easier for everyone to deal with 
- but this has little to do with mmap() difficulties 

Given what I know of the Solaris VM, VFS and of ZFS as implemented - admittedly 
incomplete, and my VM knowledge is based mostly on SVR4 - it would seem to me 
that it is going to take some Really Creative Thinking to work around the 
mmap() problem - a tweak or two ain't gonna cut it. 

-bacon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-30 Thread Bob Friesenhahn

On Tue, 29 May 2012, Iwan Aucamp wrote:

 - Is there a  parameter similar to /proc/sys/vm/swappiness that can control 
how long unused pages in page cache stay in physical ram
if there is no shortage of physical ram ? And if not how long will unused pages 
stay in page cache stay in physical ram given there
is no shortage of physical ram ?


Absent pressure for memory, no longer referenced pages will stay in 
memory forever.  They can then be re-referenced in memory.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-29 Thread Iwan Aucamp

On 05/29/2012 03:29 AM, Daniel Carosone wrote:
For the mmap case: does the ARC keep a separate copy, or does the vm 
system map the same page into the process's address space? If a 
separate copy is made, that seems like a potential source of many 
kinds of problems - if it's the same page then the whole premise is 
essentially moot and there's no double caching. 


As far as I understand, for mmap case, is that the page cache is 
distinct from ARC (i.e. normal simplified flow for reading from disk 
with mmap is DSK-ARC-PageCache) - and only page cache gets mapped into 
processes address space - which is what results in the double caching.


I have two other general questions regarding page cache with ZFS + Solaris:
 - Does anything else except mmap still use the page cache ?
 - Is there a parameter similar to /proc/sys/vm/swappiness that can 
control how long unused pages in page cache stay in physical ram if 
there is no shortage of physical ram ? And if not how long will unused 
pages stay in page cache stay in physical ram given there is no shortage 
of physical ram ?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Andrew Gabriel

On 05/28/12 20:06, Iwan Aucamp wrote:
I'm getting sub-optimal performance with an mmap based database 
(mongodb) which is running on zfs of Solaris 10u9.


System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) 
ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks


 - a few mongodb instances are running with with moderate IO and total 
rss of 50 GB
 - a service which logs quite excessively (5GB every 20 mins) is also 
running (max 2GB ram use) - log files are compressed after some time 
to bzip2.


Database performance is quite horrid though - it seems that zfs does 
not know how to manage allocation between page cache and arc cache - 
and it seems arc cache wins most of the time.


I'm thinking of doing the following:
 - relocating mmaped (mongo) data to a zfs filesystem with only 
metadata cache

 - reducing zfs arc cache to 16 GB

Is there any other recommendations - and is above likely to improve 
performance.


1. Upgrade to S10 Update 10 - this has various performance improvements, 
in particular related to database type loads (but I don't know anything 
about mongodb).


2. Reduce the ARC size so RSS + ARC + other memory users  RAM size.
I assume the RSS include's whatever caching the database does. In 
theory, a database should be able to work out what's worth caching 
better than any filesystem can guess from underneath it, so you want to 
configure more memory in the DB's cache than in the ARC. (The default 
ARC tuning is unsuitable for a database server.)


3. If the database has some concept of blocksize or recordsize that it 
uses to perform i/o, make sure the filesystems it is using configured to 
be the same recordsize. The ZFS default recordsize (128kB) is usually 
much bigger than database blocksizes. This is probably going to have 
less impact with an mmaped database than a read(2)/write(2) database, 
where it may prove better to match the filesystem's record size to the 
system's page size (4kB, unless it's using some type of large pages). I 
haven't tried playing with recordsize for memory mapped i/o, so I'm 
speculating here.


Blocksize or recordsize may apply to the log file writer too, and it may 
be that this needs a different recordsize and therefore has to be in a 
different filesystem. If it uses write(2) or some variant rather than 
mmap(2) and doesn't document this in detail, Dtrace is your friend.


4. Keep plenty of free space in the zpool if you want good database 
performance. If you're more than 60% full (S10U9) or 80% full (S10U10), 
that could be a factor.


Anyway, there are a few things to think about.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Lionel Cons
On Mon, May 28, 2012 at 9:06 PM, Iwan Aucamp aucam...@gmail.com wrote:
 I'm getting sub-optimal performance with an mmap based database (mongodb)
 which is running on zfs of Solaris 10u9.

 System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) ram
 (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks

  - a few mongodb instances are running with with moderate IO and total rss
 of 50 GB
  - a service which logs quite excessively (5GB every 20 mins) is also
 running (max 2GB ram use) - log files are compressed after some time to
 bzip2.

 Database performance is quite horrid though - it seems that zfs does not
 know how to manage allocation between page cache and arc cache - and it
 seems arc cache wins most of the time.

 I'm thinking of doing the following:
  - relocating mmaped (mongo) data to a zfs filesystem with only metadata
 cache
  - reducing zfs arc cache to 16 GB

 Is there any other recommendations - and is above likely to improve
 performance.

The only recommendation which will lead to results is to use a
different OS or filesystem. Your choices are
- FreeBSD with ZFS
- Linux with BTRFS
- Solaris with QFS
- Solaris with UFS
- Solaris with NFSv4, use ZFS on independent fileserver machines

There's a rather mythical rewrite of the Solaris virtual memory
subsystem called VM2 in progress but it will still take a long time
until this will become available for customers and there are no real
data yet whether this will help with mmap performance. It won't be
available for Opensolaris successors like Illumos available either
(likely never, at least the Illumos leadership doesn't see the need
for this and instead recommends to rewrite the applications to not use
mmap).

Lionel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Richard Elling
On May 28, 2012, at 12:46 PM, Lionel Cons wrote:

 On Mon, May 28, 2012 at 9:06 PM, Iwan Aucamp aucam...@gmail.com wrote:
 I'm getting sub-optimal performance with an mmap based database (mongodb)
 which is running on zfs of Solaris 10u9.
 
 System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) ram
 (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks
 
 - a few mongodb instances are running with with moderate IO and total rss
 of 50 GB
 - a service which logs quite excessively (5GB every 20 mins) is also
 running (max 2GB ram use) - log files are compressed after some time to
 bzip2.
 
 Database performance is quite horrid though - it seems that zfs does not
 know how to manage allocation between page cache and arc cache - and it
 seems arc cache wins most of the time.
 
 I'm thinking of doing the following:
 - relocating mmaped (mongo) data to a zfs filesystem with only metadata
 cache
 - reducing zfs arc cache to 16 GB
 
 Is there any other recommendations - and is above likely to improve
 performance.
 
 The only recommendation which will lead to results is to use a
 different OS or filesystem. Your choices are
 - FreeBSD with ZFS
 - Linux with BTRFS
 - Solaris with QFS
 - Solaris with UFS
 - Solaris with NFSv4, use ZFS on independent fileserver machines
 
 There's a rather mythical rewrite of the Solaris virtual memory
 subsystem called VM2 in progress but it will still take a long time
 until this will become available for customers and there are no real
 data yet whether this will help with mmap performance. It won't be
 available for Opensolaris successors like Illumos available either
 (likely never, at least the Illumos leadership doesn't see the need
 for this and instead recommends to rewrite the applications to not use
 mmap).

This is a mischaracterization of the statements given. The illumos team
says they will not implement Oracle's VM2 for valid, legal reasons. 
That does not mean that mmap performance improvements for ZFS 
cannot be implemented via other methods.

The primary concern for mmap files is that the RAM footprint is doubled.
If you do not manage this via limits, there can be a fight between the 
page cache and ARC over a constrained RAM resource.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Richard Elling
question below...

On May 28, 2012, at 1:25 PM, Iwan Aucamp wrote:

 On 05/28/2012 10:12 PM, Andrew Gabriel wrote:
  On 05/28/12 20:06, Iwan Aucamp wrote:
 I'm thinking of doing the following:
  - relocating mmaped (mongo) data to a zfs filesystem with only
 metadata cache
  - reducing zfs arc cache to 16 GB
 
 Is there any other recommendations - and is above likely to improve
 performance.
 1. Upgrade to S10 Update 10 - this has various performance improvements,
 in particular related to database type loads (but I don't know anything
 about mongodb).
 
 2. Reduce the ARC size so RSS + ARC + other memory users  RAM size.
 I assume the RSS include's whatever caching the database does. In
 theory, a database should be able to work out what's worth caching
 better than any filesystem can guess from underneath it, so you want to
 configure more memory in the DB's cache than in the ARC. (The default
 ARC tuning is unsuitable for a database server.)
 
 3. If the database has some concept of blocksize or recordsize that it
 uses to perform i/o, make sure the filesystems it is using configured to
 be the same recordsize. The ZFS default recordsize (128kB) is usually
 much bigger than database blocksizes. This is probably going to have
 less impact with an mmaped database than a read(2)/write(2) database,
 where it may prove better to match the filesystem's record size to the
 system's page size (4kB, unless it's using some type of large pages). I
 haven't tried playing with recordsize for memory mapped i/o, so I'm
 speculating here.
 
 Blocksize or recordsize may apply to the log file writer too, and it may
 be that this needs a different recordsize and therefore has to be in a
 different filesystem. If it uses write(2) or some variant rather than
 mmap(2) and doesn't document this in detail, Dtrace is your friend.
 
 4. Keep plenty of free space in the zpool if you want good database
 performance. If you're more than 60% full (S10U9) or 80% full (S10U10),
 that could be a factor.
 
 Anyway, there are a few things to think about.
 
 Thanks for the Feedback, I cannot really do 1, but will look into points 3 
 and 4 - in addition to 2 - which is what I desire to achieve with my second 
 point - but I would still like to know if it is recommended to only do 
 metadata caching for mmaped files (mongodb data files) - the way I see it 
 this should get rid of the double caching which is being done for mmaped 
 files.

I'd be interested in the results of such tests. You can change the primarycache
parameter on the fly, so you could test it in less time than it takes for me to 
type
this email :-)
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Lionel Cons
On 28 May 2012 22:10, Richard Elling richard.ell...@gmail.com wrote:
 The only recommendation which will lead to results is to use a
 different OS or filesystem. Your choices are
 - FreeBSD with ZFS
 - Linux with BTRFS
 - Solaris with QFS
 - Solaris with UFS
 - Solaris with NFSv4, use ZFS on independent fileserver machines

 There's a rather mythical rewrite of the Solaris virtual memory
 subsystem called VM2 in progress but it will still take a long time
 until this will become available for customers and there are no real
 data yet whether this will help with mmap performance. It won't be
 available for Opensolaris successors like Illumos available either
 (likely never, at least the Illumos leadership doesn't see the need
 for this and instead recommends to rewrite the applications to not use
 mmap).


 This is a mischaracterization of the statements given. The illumos team
 says they will not implement Oracle's VM2 for valid, legal reasons.
 That does not mean that mmap performance improvements for ZFS
 cannot be implemented via other methods.

I'd like to hear what the other methods should be. The lack of mmap
performance is only a symptom of a more severe disease. Just doing
piecework and alter the VFS API to integrate ZFS/ARC/VM with each
other doesn't fix the underlying problems.

I've assigned two of my staff, one familiar with the FreeBSD VM and
one familiar with the Linux VM, to look at the current VM subsystem
and their preliminary reports point to disaster. If Illumos does not
initiate a VM rewrite project of it's own which will make the VM aware
of NUMA, power management and other issues then I predict nothing less
than the downfall of Illumos within a couple of years because the
performance impact is dramatic and makes the Illumos kernel no longer
competitive.
Despite these findings, of which Sun was aware for a long time, and
the number of ex-Sun employees working on Illumos, I miss the
commitment to launch such a project. That's why I said likely never,
unless of course someone slams Garrett's head with sufficient force on
a wooden table to make him see the reality.

The reality is:
- The modern x86 server platforms are now all NUMA or NUMA-like. Lack
of NUMA support leads to bad performance
- They all use some kind of serialized link between CPU nodes, let it
be Hypertransport or Quickpath, with power management. If power
management is active and has reduced the number of active links
between nodes and the OS doesn't manage this correctly you'll get bad
performance. Illumo's VM isn't even remotely aware of this fact
- Based on simulator testing we see that in a simulated environment
with 8 sockets almost 40% of kernel memory accesses are _REMOTE_
accesses, i.e. it's not local to the node accessing it
That are all preliminary results, I expect that the remainder of the
analysis will take another 4-5 weeks until we present the findings to
the Illumos community. But I can say already it will be a faceslap for
those who think that Illumos doesn't need a better VM system.

 The primary concern for mmap files is that the RAM footprint is doubled.

It's not only that RAM is doubled, the data are copied between both
ARC and page cache multiple times. You can say memory and the in
memory copy operation are cheap, but this and the lack of NUMA
awareness is a real performance killer.

Lionel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Richard Elling
[Apologies to the list, this has expanded past ZFS, if someone complains, we can
move the thread to another illumos dev list]

On May 28, 2012, at 2:18 PM, Lionel Cons wrote:

 On 28 May 2012 22:10, Richard Elling richard.ell...@gmail.com wrote:
 The only recommendation which will lead to results is to use a
 different OS or filesystem. Your choices are
 - FreeBSD with ZFS
 - Linux with BTRFS
 - Solaris with QFS
 - Solaris with UFS
 - Solaris with NFSv4, use ZFS on independent fileserver machines
 
 There's a rather mythical rewrite of the Solaris virtual memory
 subsystem called VM2 in progress but it will still take a long time
 until this will become available for customers and there are no real
 data yet whether this will help with mmap performance. It won't be
 available for Opensolaris successors like Illumos available either
 (likely never, at least the Illumos leadership doesn't see the need
 for this and instead recommends to rewrite the applications to not use
 mmap).
 
 
 This is a mischaracterization of the statements given. The illumos team
 says they will not implement Oracle's VM2 for valid, legal reasons.
 That does not mean that mmap performance improvements for ZFS
 cannot be implemented via other methods.
 
 I'd like to hear what the other methods should be. The lack of mmap
 performance is only a symptom of a more severe disease. Just doing
 piecework and alter the VFS API to integrate ZFS/ARC/VM with each
 other doesn't fix the underlying problems.
 
 I've assigned two of my staff, one familiar with the FreeBSD VM and
 one familiar with the Linux VM, to look at the current VM subsystem
 and their preliminary reports point to disaster. If Illumos does not
 initiate a VM rewrite project of it's own which will make the VM aware
 of NUMA, power management and other issues then I predict nothing less
 than the downfall of Illumos within a couple of years because the
 performance impact is dramatic and makes the Illumos kernel no longer
 competitive.
 Despite these findings, of which Sun was aware for a long time, and
 the number of ex-Sun employees working on Illumos, I miss the
 commitment to launch such a project. That's why I said likely never,
 unless of course someone slams Garrett's head with sufficient force on
 a wooden table to make him see the reality.
 
 The reality is:
 - The modern x86 server platforms are now all NUMA or NUMA-like. Lack
 of NUMA support leads to bad performance

SPARC has been NUMA since 1997 and Solaris changed the scheduler
long ago.

 - They all use some kind of serialized link between CPU nodes, let it
 be Hypertransport or Quickpath, with power management. If power
 management is active and has reduced the number of active links
 between nodes and the OS doesn't manage this correctly you'll get bad
 performance. Illumo's VM isn't even remotely aware of this fact
 - Based on simulator testing we see that in a simulated environment
 with 8 sockets almost 40% of kernel memory accesses are _REMOTE_
 accesses, i.e. it's not local to the node accessing it
 That are all preliminary results, I expect that the remainder of the
 analysis will take another 4-5 weeks until we present the findings to
 the Illumos community. But I can say already it will be a faceslap for
 those who think that Illumos doesn't need a better VM system.

Nobody said illumos doesn't need a better VM system. The statement was that 
illumos is not going to reverse-engineer Oracle's VM2.

 The primary concern for mmap files is that the RAM footprint is doubled.
 
 It's not only that RAM is doubled, the data are copied between both
 ARC and page cache multiple times. You can say memory and the in
 memory copy operation are cheap, but this and the lack of NUMA
 awareness is a real performance killer.

Anybody who has worked on a SPARC system for the past 15 years is well
aware of NUMAness. We've been living in a NUMA world for a very long time,
a world where the processors were slow and far memory latency is much, much
worse than we see in the x86 world.

I look forward to seeing the results of your analysis and experiments.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Jim Klimov

2012-05-29 0:34, Richard Elling wrote:

I'd be interested in the results of such tests. You can change the
primarycache
parameter on the fly, so you could test it in less time than it takes
for me to type
this email :-)



I believe it would also take some time for memory distribution
to settle, expiring ARC data pages and actually claiming the
RAM for the application... Right? ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Daniel Carosone
On Mon, May 28, 2012 at 01:34:18PM -0700, Richard Elling wrote:
 I'd be interested in the results of such tests. 

Me too, especially for databases like postgresql where there's a
complementary cache size tunable within the db that often needs to be
turned up, since they implicitly rely on some filesystem caching as a L2. 

That's where this gets tricky: L2ARC has the opportunity to make a big
difference, where the entire db won't all fit in memory (regardless of
which subsystem has jurisdiction over that memory).  If you exclude
data from ARC, you can't spill it to L2ARC.

For the mmap case: does the ARC keep a separate copy, or does the vm
system map the same page into the process's address space?  If a
separate copy is made, that seems like a potential source of many
kinds of problems - if it's the same page then the whole premise is
essentially moot and there's no double caching.

--
Dan.

pgp202dgDhYDG.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss