[Apologies to the list, this has expanded past ZFS, if someone complains, we can
move the thread to another illumos dev list]

On May 28, 2012, at 2:18 PM, Lionel Cons wrote:

> On 28 May 2012 22:10, Richard Elling <richard.ell...@gmail.com> wrote:
>> The only recommendation which will lead to results is to use a
>> different OS or filesystem. Your choices are
>> - FreeBSD with ZFS
>> - Linux with BTRFS
>> - Solaris with QFS
>> - Solaris with UFS
>> - Solaris with NFSv4, use ZFS on independent fileserver machines
>> There's a rather mythical rewrite of the Solaris virtual memory
>> subsystem called VM2 in progress but it will still take a long time
>> until this will become available for customers and there are no real
>> data yet whether this will help with mmap performance. It won't be
>> available for Opensolaris successors like Illumos available either
>> (likely never, at least the Illumos leadership doesn't see the need
>> for this and instead recommends to rewrite the applications to not use
>> mmap).
>> This is a mischaracterization of the statements given. The illumos team
>> says they will not implement Oracle's VM2 for valid, legal reasons.
>> That does not mean that mmap performance improvements for ZFS
>> cannot be implemented via other methods.
> I'd like to hear what the other methods should be. The lack of mmap
> performance is only a symptom of a more severe disease. Just doing
> piecework and alter the VFS API to integrate ZFS/ARC/VM with each
> other doesn't fix the underlying problems.
> I've assigned two of my staff, one familiar with the FreeBSD VM and
> one familiar with the Linux VM, to look at the current VM subsystem
> and their preliminary reports point to disaster. If Illumos does not
> initiate a VM rewrite project of it's own which will make the VM aware
> of NUMA, power management and other issues then I predict nothing less
> than the downfall of Illumos within a couple of years because the
> performance impact is dramatic and makes the Illumos kernel no longer
> competitive.
> Despite these findings, of which Sun was aware for a long time, and
> the number of ex-Sun employees working on Illumos, I miss the
> commitment to launch such a project. That's why I said "likely never",
> unless of course someone slams Garrett's head with sufficient force on
> a wooden table to make him see the reality.
> The reality is:
> - The modern x86 server platforms are now all NUMA or NUMA-like. Lack
> of NUMA support leads to bad performance

SPARC has been NUMA since 1997 and Solaris changed the scheduler
long ago.

> - They all use some kind of serialized link between CPU nodes, let it
> be Hypertransport or Quickpath, with power management. If power
> management is active and has reduced the number of active links
> between nodes and the OS doesn't manage this correctly you'll get bad
> performance. Illumo's VM isn't even remotely aware of this fact
> - Based on simulator testing we see that in a simulated environment
> with 8 sockets almost 40% of kernel memory accesses are _REMOTE_
> accesses, i.e. it's not local to the node accessing it
> That are all preliminary results, I expect that the remainder of the
> analysis will take another 4-5 weeks until we present the findings to
> the Illumos community. But I can say already it will be a faceslap for
> those who think that Illumos doesn't need a better VM system.

Nobody said illumos doesn't need a better VM system. The statement was that 
illumos is not going to reverse-engineer Oracle's VM2.

>> The primary concern for mmap files is that the RAM footprint is doubled.
> It's not only that RAM is doubled, the data are copied between both
> ARC and page cache multiple times. You can say memory and the in
> memory copy operation are cheap, but this and the lack of NUMA
> awareness is a real performance killer.

Anybody who has worked on a SPARC system for the past 15 years is well
aware of NUMAness. We've been living in a NUMA world for a very long time,
a world where the processors were slow and far memory latency is much, much
worse than we see in the x86 world.

I look forward to seeing the results of your analysis and experiments.
 -- richard

ZFS Performance and Training

zfs-discuss mailing list

Reply via email to