Re: [webkit-dev] WTF::fastMalloc

2013-10-05 Thread Maciej Stachowiak

These days, pure JavaScript tests won't do a lot of malloc() calls, so it's 
more relevant to try a page load speed or DOM benchmark.

 - Maciej

On Oct 4, 2013, at 6:10 AM, Osztrogonác Csaba o...@inf.u-szeged.hu wrote:

 Hi,
 
 As Zoltan said this feature was introduced for Qt port. But now
 EFL, GTK and Nix use fastmalloc instead of system malloc too.
 It was fine and used for some use-cases in those days.
 
 To make a decision if the fastmalloc or the system malloc is better,
 we need some measurements. I made a quick test on EFL and Nix with
 SunSpider and with the Methanol test suite and haven't seen any
 significant performance differences between fastmalloc and system
 malloc on my desktop: Ubuntu 12.04 (x86_64). I haven't checked the
 memory consumption, it would need more preparation.
 
 Keeping the old TCMalloc and the custom allocator framework isn't
 blocker for us (University of Szeged), so we don't have objection
 against removing it from trunk. If nodbody is interested in maintaining
 the framework, it can be removed. If the final conclusion would be
 dropping TCMalloc, we willingly help in this clean-up.
 
 Ossy
 
 Zoltan Horvath írta:
 I used to work on memory related topics, while I was working on the 
 University of Szeged.
 Based on a 2.5-year-old measurement 
 (http://webkit.sed.hu/blog/20100302/war-allocators-qtlaunchers-coast) on the 
 Qt-port, the page loading on the Methanol test suite was 5% faster (avg) 
 with TCmalloc than the default system allocator on Linux. The performance 
 results of the SunSpider suite was similar for both allocators. The memory 
 consumption was always lower with the default os allocator. I guess the new 
 allocator only has iOS support. I'm fine with removing TCmalloc, although 
 this direction might raises further questions, like removing the custom 
 allocation framework also. Feel free to cc me on bugs, I can help by 
 contributing some patches. 
 
 On Mon, Sep 30, 2013 at 2:48 PM, Geoffrey Garen gga...@apple.com I'm 
 planning to remove our years-out-of-date port of TCMalloc, and
replace it with something that takes maximum advantage of Mac and
iOS virtual memory, threading, and security APIs.
I've heard that TCMalloc has caused some problems for non-Mac,
non-iOS ports in the past. So, if you maintain a port, this change
might make things simpler for you.
Are there any ports whose built-in malloc implementations are slow
enough that they can't get by without TCMalloc?
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-04 Thread Osztrogonác Csaba

Hi,

As Zoltan said this feature was introduced for Qt port. But now
EFL, GTK and Nix use fastmalloc instead of system malloc too.
It was fine and used for some use-cases in those days.

To make a decision if the fastmalloc or the system malloc is better,
we need some measurements. I made a quick test on EFL and Nix with
SunSpider and with the Methanol test suite and haven't seen any
significant performance differences between fastmalloc and system
malloc on my desktop: Ubuntu 12.04 (x86_64). I haven't checked the
memory consumption, it would need more preparation.

Keeping the old TCMalloc and the custom allocator framework isn't
blocker for us (University of Szeged), so we don't have objection
against removing it from trunk. If nodbody is interested in maintaining
the framework, it can be removed. If the final conclusion would be
dropping TCMalloc, we willingly help in this clean-up.

Ossy

Zoltan Horvath írta:
I used to work on memory related topics, while I was working on the 
University of Szeged.


Based on a 2.5-year-old measurement 
(http://webkit.sed.hu/blog/20100302/war-allocators-qtlaunchers-coast) on 
the Qt-port, the page loading on the Methanol test suite was 5% faster 
(avg) with TCmalloc than the default system allocator on Linux. The 
performance results of the SunSpider suite was similar for both 
allocators. The memory consumption was always lower with the default os 
allocator. 

I guess the new allocator only has iOS support. I'm fine with removing 
TCmalloc, although this direction might raises further questions, like 
removing the custom allocation framework also. Feel free to cc me on 
bugs, I can help by contributing some patches. 


On Mon, Sep 30, 2013 at 2:48 PM, Geoffrey Garen gga...@apple.com 
I'm planning to remove our years-out-of-date port of TCMalloc, and

replace it with something that takes maximum advantage of Mac and
iOS virtual memory, threading, and security APIs.

I've heard that TCMalloc has caused some problems for non-Mac,
non-iOS ports in the past. So, if you maintain a port, this change
might make things simpler for you.

Are there any ports whose built-in malloc implementations are slow
enough that they can't get by without TCMalloc?

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-02 Thread Konstantin Tokarev

02.10.2013, 03:18, Zoltan Horvath zol...@webkit.org:
 On Tue, Oct 1, 2013 at 3:52 PM, Geoffrey Garen gga...@apple.com wrote:
 So are you proposing to use the system allocator on Windows?

 I’m proposing a two step process:

 (1) Use the system allocator on Windows (and GTK).
 (2) If a port maintainer cares to optimize a given port, without too much 
 disruption to mainline code, they may do so.

 FWIW, If I were conducting (2) for Windows, malloc would be pretty far down 
 the list of things I started porting.

 The current malloc logic has been the source of a number of mysterious 
 crashes on Windows, so reverting to the system allocator might be a good 
 thing for stability. I don’t know what the potential performance 
 ramifications would be.

 Yes, I’ve heard that on other platforms as well.

 This usually happens because the allocation/free mismatches. (In cases such 
 as memory allocated by TCmalloc via the FastMalloc interface (fastMalloc, 
 fastNewMalloc) and tried to be freed by the system free.)

Out of curiosity, what's wrong with linking whole application using WebKit 
against tcmalloc or some other malloc implementation? This way it's possible to 
use optimized allocator without any source changes, and malloc/free mismatch 
cannot happen. Why FastMalloc API was needed at all?

-- 
Regards,
Konstantin
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-02 Thread Darin Adler
On Oct 2, 2013, at 1:17 AM, Konstantin Tokarev annu...@yandex.ru wrote:

 Out of curiosity, what's wrong with linking whole application using WebKit 
 against tcmalloc or some other malloc implementation?

There are a lot of things wrong with that. Most of them depend on the platform.

On Mac, for example, WebKit is a framework. Linking apps using WebKit against a 
different malloc implementation would have no effect on WebKit’s memory 
allocation. Further, doing this would create allocator mismatch problems for 
any memory allocated by WebKit but freed by the application or vice versa. 
There are many other problems with this approach on Mac. Another one is that 
there are at least thousands of apps currently using WebKit on Mac, maybe tens 
of thousands (hundreds of thousands, at least, on iOS), and so if this is 
something the app developer has to do, there are a lot of people to reach.

-- Darin
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-02 Thread Maciej Stachowiak

On Oct 2, 2013, at 1:17 AM, Konstantin Tokarev annu...@yandex.ru wrote:

 
 02.10.2013, 03:18, Zoltan Horvath zol...@webkit.org:
 On Tue, Oct 1, 2013 at 3:52 PM, Geoffrey Garen gga...@apple.com wrote:
 So are you proposing to use the system allocator on Windows?
 
 I’m proposing a two step process:
 
 (1) Use the system allocator on Windows (and GTK).
 (2) If a port maintainer cares to optimize a given port, without too much 
 disruption to mainline code, they may do so.
 
 FWIW, If I were conducting (2) for Windows, malloc would be pretty far down 
 the list of things I started porting.
 
 The current malloc logic has been the source of a number of mysterious 
 crashes on Windows, so reverting to the system allocator might be a good 
 thing for stability. I don’t know what the potential performance 
 ramifications would be.
 
 Yes, I’ve heard that on other platforms as well.
 
 This usually happens because the allocation/free mismatches. (In cases such 
 as memory allocated by TCmalloc via the FastMalloc interface (fastMalloc, 
 fastNewMalloc) and tried to be freed by the system free.)
 
 Out of curiosity, what's wrong with linking whole application using WebKit 
 against tcmalloc or some other malloc implementation? This way it's possible 
 to use optimized allocator without any source changes, and malloc/free 
 mismatch cannot happen. Why FastMalloc API was needed at all?

We couldn't find a clean way to do this on Mac because some low-level 
frameworks make use of specific obscure features of the system allocator. But 
it may be viable on other platforms.

Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-02 Thread Maciej Stachowiak

On Oct 2, 2013, at 2:41 AM, Andy Wingo wi...@igalia.com wrote:

 
 We need somebody to resolve these issues, otherwise our memory
 footprint will be unacceptably high, and/or our VM operations will be
 unacceptably slow.
 
 There is no memory footprint problem caused by mmap here -- to my
 knowledge.  I don't know how to profile the VM overhead, though.

It's easy to fix the VM overhead by unmapping the extra at either end, if 
running out of address space is a real risk.

 - Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-02 Thread Geoffrey Garen
 However, given the constraints, what's the problem with the mmap
 strategy?  Sure, you have more page tables on the kernel side, but
 mmap'd memory that is never touched is never resident in a process.  I
 verified this a few months back when troubleshooting some memory-related
 issues.

Okeedokee. Can you write up this patch for OSAllocator?

 
 (2) POSIX uses MADV_FREE, MADV_DONTNEED, and/or MADV_WILLNEED. I don’t
 think anybody has ever verified that these APIs do what we want. In my
 experience, they usually don’t. So, we need to find a variation on these
 APIs that works and is fast.
 
 I've looked into it.  The MADV_WILLNEED is useless -- it does nothing on
 anonymous pages, returns -EINVAL, but is harmless also.  The
 MADV_DONTNEED dance does work though, properly paging out memory and
 lazily providing fresh zeroed pages should the memory be paged in again.

The API we want shouldn’t zero the pages or require a page fault right away. It 
should only zero the pages if they end up being used by the rest of the system. 
In the normal case, it should return the pages to use intact. Otherwise, it 
will be too slow, and we’ll have to jump through hoops to avoid using the API 
very much, which confuses the design.

Geoff
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Martin Robinson
On Mon, Sep 30, 2013 at 7:41 PM, Zoltan Horvath zol...@webkit.org wrote:

 Based on a 2.5-year-old measurement
 (http://webkit.sed.hu/blog/20100302/war-allocators-qtlaunchers-coast) on the
 Qt-port, the page loading on the Methanol test suite was 5% faster (avg)
 with TCmalloc than the default system allocator on Linux. The performance
 results of the SunSpider suite was similar for both allocators. The memory
 consumption was always lower with the default os allocator.

A 5% regression in page load performance seems pretty serious.

--Martin
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Geoffrey Garen
 A 5% regression in page load performance seems pretty serious.

I’m assuming you’re considering the GTK port here, and not the end-of-life Qt 
port.

Are you up for some engineering work to adopt a better malloc for GTK?

Here’s a rough task list:

(1) Define a canonical GTK platform we’ll use for performance measurement.

(2) Measure FastMalloc on/off on that platform.

Assuming FastMalloc is a significant improvement:

(1) Refactor GTK APIs so that API-level objects are not allocated/deleted by 
global operator new/delete in WebCore+JavaScriptCore.

(1a) Either build the API layer as a separate library from 
WebCore+JavaScriptCore,

(1b) or specifically annotate each object at the API library with a 
per-class operator new / operator delete.

(2) Find a fast secure random number API on the canonical GTK platform.

(3) Find a fast thread-specific data API on the canonical GTK platform.

If you take on these tasks, I’m happy to take on the larger task of providing a 
fast malloc for GTK WebKit.

Geoff
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Geoffrey Garen
 Here’s a rough task list:
 
 (1) Define a canonical GTK platform we’ll use for performance measurement.
 
 (2) Measure FastMalloc on/off on that platform.
 
 Assuming FastMalloc is a significant improvement:
 
 (1) Refactor GTK APIs so that API-level objects are not allocated/deleted by 
 global operator new/delete in WebCore+JavaScriptCore.
 
   (1a) Either build the API layer as a separate library from 
 WebCore+JavaScriptCore,
 
   (1b) or specifically annotate each object at the API library with a 
 per-class operator new / operator delete.
 
 (2) Find a fast secure random number API on the canonical GTK platform.
 
 (3) Find a fast thread-specific data API on the canonical GTK platform.

(4) Find a fast API for aligned virtual memory allocation.

(5) Find a fast API for committing / decommitting physical memory without 
releasing virtual memory pages.

Geoff
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Oliver Hunt

On Oct 1, 2013, at 11:56 AM, Martin Robinson mrobin...@webkit.org wrote:

 On Tue, Oct 1, 2013 at 11:33 AM, Geoffrey Garen gga...@apple.com wrote:
 A 5% regression in page load performance seems pretty serious.
 
 I’m assuming you’re considering the GTK port here, and not the end-of-life 
 Qt port.
 
 Are you up for some engineering work to adopt a better malloc for GTK?
 
 I appreciate your offer!
 
 Here’s a rough task list:
 
 (1) Define a canonical GTK platform we’ll use for performance measurement.
 
 Perhaps the University of Szeged team has some insight into what
 platforms they used for comparing allocator performance.
 
 (1) Refactor GTK APIs so that API-level objects are not allocated/deleted by 
 global operator new/delete in WebCore+JavaScriptCore.
(1a) Either build the API layer as a separate library from 
 WebCore+JavaScriptCore,
(1b) or specifically annotate each object at the API library with a 
 per-class operator new / operator delete.
 
 I don't think this should be a problem. Currently all allocations of
 API-level objects happen with the GLib slab allocator (or system
 malloc/free, given the right environment arguments).
 
 (2) Find a fast secure random number API on the canonical GTK platform.
 
 I can look into this.

WTF has a custom implementation of arc4random(), i suspect most current Gtk 
host environments have a native one as well (s_rand on windows is terribly 
slow, but like i said, WTF has its own secure generator that will seed 
appropriately)

 
 (3) Find a fast thread-specific data API on the canonical GTK platform.
 
 Threading for GTK+ on non-Mac/non-Windows platforms is essentially
 pthreads. It probably wouldn't be a lot of work to defer to Windows
 and Mac implementations on those platforms.

I recall linux having fast thread locals, as does windows.

--Oliver

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Maciej Stachowiak

On Sep 30, 2013, at 2:48 PM, Geoffrey Garen gga...@apple.com wrote:

 Hi folks.
 
 I’m planning to remove our years-out-of-date port of TCMalloc, and replace it 
 with something that takes maximum advantage of Mac and iOS virtual memory, 
 threading, and security APIs.
 
 I've heard that TCMalloc has caused some problems for non-Mac, non-iOS ports 
 in the past. So, if you maintain a port, this change might make things 
 simpler for you.
 
 Are there any ports whose built-in malloc implementations are slow enough 
 that they can’t get by without TCMalloc?

Apple's Windows port uses FastMalloc and the last measurements we took show it 
to be a large performance gain over the default Windows malloc implementation. 
While this port is only used by iTunes these days, we still would not want to 
regress its performance. Can the new allocator be made to work with Windows?

Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Martin Robinson
On Tue, Oct 1, 2013 at 11:34 AM, Geoffrey Garen gga...@apple.com wrote:

 (4) Find a fast API for aligned virtual memory allocation.
 (5) Find a fast API for committing / decommitting physical memory without 
 releasing virtual memory pages.

Hrm. Isn't this already available via OSAllocator or are you referring
to the fact that the Posix implementation has a few problems?

--Martin
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Zoltan Horvath
On Tue, Oct 1, 2013 at 11:56 AM, Martin Robinson mrobin...@webkit.orgwrote:


  Here’s a rough task list:
 
  (1) Define a canonical GTK platform we’ll use for performance
 measurement.

 Perhaps the University of Szeged team has some insight into what
 platforms they used for comparing allocator performance.


I measured the performance and memory for Qt on desktop and on some ARM
based embedded devices (e.g. Nokia N9). The blogs are still available on
the blog site, but I'm not sure we can consider the numbers as valid after
that many years. Please note also, I've working for Adobe for more than a
year now, so I don't know whether the University team has any recent public
results.

The goal for enabling TCmalloc on Qt/Gtk was to match the implementation
with the Apple port, which used TCmalloc at time. Please note also, only a
subset of QtWebKit platforms uses TCmalloc (linux, mac), the rest of them
still uses the default system allocator.

 (1) Refactor GTK APIs so that API-level objects are not allocated/deleted
 by global operator new/delete in WebCore+JavaScriptCore.
  (1a) Either build the API layer as a separate library from
 WebCore+JavaScriptCore,
  (1b) or specifically annotate each object at the API library
 with a per-class operator new / operator delete.

 I don't think this should be a problem. Currently all allocations of
 API-level objects happen with the GLib slab allocator (or system
 malloc/free, given the right environment arguments).

  (2) Find a fast secure random number API on the canonical GTK platform.

 I can look into this.

  (3) Find a fast thread-specific data API on the canonical GTK platform.

 Threading for GTK+ on non-Mac/non-Windows platforms is essentially
 pthreads. It probably wouldn't be a lot of work to defer to Windows
 and Mac implementations on those platforms.

 --Martin

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Geoffrey Garen
 (4) Find a fast API for aligned virtual memory allocation.
 (5) Find a fast API for committing / decommitting physical memory without 
 releasing virtual memory pages.
 
 Hrm. Isn't this already available via OSAllocator or are you referring
 to the fact that the Posix implementation has a few problems?

OSAllocator is the right model, yes. 

There are two problems with the current OSAllocator POSIX implementation:

(1) It uses mmap, which doesn’t support aligned allocation. To get aligned 
allocation, POSIX double-allocates all virtual memory. That is 2X too much. So, 
we need to find a variation on mmap that supports an alignment constraint.

(2) POSIX uses MADV_FREE, MADV_DONTNEED, and/or MADV_WILLNEED. I don’t think 
anybody has ever verified that these APIs do what we want. In my experience, 
they usually don’t. So, we need to find a variation on these APIs that works 
and is fast.

We need somebody to resolve these issues, otherwise our memory footprint will 
be unacceptably high, and/or our VM operations will be unacceptably slow.

Geoff
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Geoffrey Garen
 Apple's Windows port uses FastMalloc and the last measurements we took show 
 it to be a large performance gain over the default Windows malloc 
 implementation.

I believe those measurements were taken 5 Windows versions ago.

 While this port is only used by iTunes these days, we still would not want to 
 regress its performance. Can the new allocator be made to work with Windows?

The set of porting tasks is the same set I outlined for GTK.

The Windows port is missing many performance features, including tiled 
scrolling, LLInt, parallel garbage collection, DFG, and FTL. Given those other 
major missing pieces, I don’t think this piece is worth the porting time.

Geoff
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Brent Fulgham
So are you proposing to use the system allocator on Windows? Or would we keep 
using the existing FastMalloc implementation?

The current malloc logic has been the source of a number of mysterious crashes 
on Windows, so reverting to the system allocator might be a good thing for 
stability. I don’t know what the potential performance ramifications would be.

-Brent

On Oct 1, 2013, at 3:23 PM, Geoffrey Garen gga...@apple.com wrote:

 Apple's Windows port uses FastMalloc and the last measurements we took show 
 it to be a large performance gain over the default Windows malloc 
 implementation.
 
 I believe those measurements were taken 5 Windows versions ago.
 
 While this port is only used by iTunes these days, we still would not want 
 to regress its performance. Can the new allocator be made to work with 
 Windows?
 
 The set of porting tasks is the same set I outlined for GTK.
 
 The Windows port is missing many performance features, including tiled 
 scrolling, LLInt, parallel garbage collection, DFG, and FTL. Given those 
 other major missing pieces, I don’t think this piece is worth the porting 
 time.
 
 Geoff
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Maciej Stachowiak

On Oct 1, 2013, at 3:35 PM, Brent Fulgham bfulg...@apple.com wrote:

 So are you proposing to use the system allocator on Windows? Or would we keep 
 using the existing FastMalloc implementation?
 
 The current malloc logic has been the source of a number of mysterious 
 crashes on Windows, so reverting to the system allocator might be a good 
 thing for stability. I don’t know what the potential performance 
 ramifications would be.

They would be bad. The default malloc on Windows is very slow.

 - Maciej

 
 -Brent
 
 On Oct 1, 2013, at 3:23 PM, Geoffrey Garen gga...@apple.com wrote:
 
 Apple's Windows port uses FastMalloc and the last measurements we took show 
 it to be a large performance gain over the default Windows malloc 
 implementation.
 
 I believe those measurements were taken 5 Windows versions ago.
 
 While this port is only used by iTunes these days, we still would not want 
 to regress its performance. Can the new allocator be made to work with 
 Windows?
 
 The set of porting tasks is the same set I outlined for GTK.
 
 The Windows port is missing many performance features, including tiled 
 scrolling, LLInt, parallel garbage collection, DFG, and FTL. Given those 
 other major missing pieces, I don’t think this piece is worth the porting 
 time.
 
 Geoff
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev
 

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Maciej Stachowiak

On Oct 1, 2013, at 3:05 PM, Geoffrey Garen gga...@apple.com wrote:

 (3) Find a fast thread-specific data API on the canonical GTK platform.
 
 Threading for GTK+ on non-Mac/non-Windows platforms is essentially
 pthreads.
 
 To access thread-specific data using pthreads, you first need to take a lock 
 and call pthread_key_create(). Since the whole point of thread-specific data 
 is to avoid taking a lock, the API is useless.

The normal way to do it is to use pthread_once to create the key, which does 
not in general take a lock. (That or use an out-of-band prior initializer, but 
that wouldn't work for malloc).

 
 You’ll need an alternative to the cross-platform pthread API for accessing 
 thread-specific data. Otherwise, the cost of that API will dominate any other 
 cost, and it won’t be worth our time to try to optimize other things.

FastMalloc uses vanilla pthread_getspecific() all the time (including at least 
on every malloc call) on platforms that don't have a faster form of 
thread-specific data (such as pthread_getspecific on Mac or __thread on 
Windows). While it makes a difference, FastMalloc still tends to be faster 
overall than system malloc implementations. So I suspect it would work ok for 
the new malloc as well. Probably the easiest way to find out is to test.

C++11 also introduces the thread_local keyword which is likely more readily 
optimizable than function-call-based APIs where supported.

Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Maciej Stachowiak

On Oct 1, 2013, at 3:11 PM, Geoffrey Garen gga...@apple.com wrote:

 (4) Find a fast API for aligned virtual memory allocation.
 (5) Find a fast API for committing / decommitting physical memory without 
 releasing virtual memory pages.
 
 Hrm. Isn't this already available via OSAllocator or are you referring
 to the fact that the Posix implementation has a few problems?
 
 OSAllocator is the right model, yes. 
 
 There are two problems with the current OSAllocator POSIX implementation:
 
 (1) It uses mmap, which doesn’t support aligned allocation. To get aligned 
 allocation, POSIX double-allocates all virtual memory. That is 2X too much. 
 So, we need to find a variation on mmap that supports an alignment constraint.

PageAllocationAligned.cpp does this, but it would be more effective to unmap 
the unneeded extra at each end (or use mremap on systems that have it). That 
would be extra VM calls but would not require 2x the space. I'm not sure why it 
doesn't do that already - perhaps because it is building on top of OSAllocator 
and no one tried hard enough to optimize it.

(The current FastMalloc doesn't try to align its requests for system memory to 
more than a page boundary so it doesn't have this issue.)

 
 (2) POSIX uses MADV_FREE, MADV_DONTNEED, and/or MADV_WILLNEED. I don’t think 
 anybody has ever verified that these APIs do what we want. In my experience, 
 they usually don’t. So, we need to find a variation on these APIs that works 
 and is fast.

I don't have the expertise to know what these do or whether it is what we want. 
But our current malloc uses these, so it would not be a regression for the new 
malloc to use them even if they are subtly wrong, unless there is something 
wildly different about its use of system memory.

 
 We need somebody to resolve these issues, otherwise our memory footprint will 
 be unacceptably high, and/or our VM operations will be unacceptably slow.
 
 Geoff
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Geoffrey Garen
 To access thread-specific data using pthreads, you first need to take a lock 
 and call pthread_key_create(). Since the whole point of thread-specific data 
 is to avoid taking a lock, the API is useless.
 
 The normal way to do it is to use pthread_once to create the key, which does 
 not in general take a lock. (That or use an out-of-band prior initializer, 
 but that wouldn't work for malloc).

Most implementations of pthread_once use a spinlock, or some moral equivalent. 
Fundamentally, there’s no memory-safe way to implement concurrent one-time 
execution of arbitrary side effects without a spinlock. That’s why requiring 
concurrent one-time execution of arbitrary side effects in order to access 
thread-specific memory is broken API.

 C++11 also introduces the thread_local keyword which is likely more readily 
 optimizable than function-call-based APIs where supported.

thread_local might be a reasonable option, if a platform achieves all the other 
requirements for fast malloc. It’s still too slow, but at least it isn’t slow 
by definition, and it doesn’t pollute the rest of the code too badly.

Geoff
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Geoffrey Garen
 So are you proposing to use the system allocator on Windows?

I’m proposing a two step process:

(1) Use the system allocator on Windows (and GTK).

(2) If a port maintainer cares to optimize a given port, without too much 
disruption to mainline code, they may do so.

FWIW, If I were conducting (2) for Windows, malloc would be pretty far down the 
list of things I started porting.

 The current malloc logic has been the source of a number of mysterious 
 crashes on Windows, so reverting to the system allocator might be a good 
 thing for stability. I don’t know what the potential performance 
 ramifications would be.

Yes, I’ve heard that on other platforms as well.

Geoff

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Maciej Stachowiak

On Oct 1, 2013, at 3:47 PM, Geoffrey Garen gga...@apple.com wrote:

 To access thread-specific data using pthreads, you first need to take a 
 lock and call pthread_key_create(). Since the whole point of 
 thread-specific data is to avoid taking a lock, the API is useless.
 
 The normal way to do it is to use pthread_once to create the key, which does 
 not in general take a lock. (That or use an out-of-band prior initializer, 
 but that wouldn't work for malloc).
 
 Most implementations of pthread_once use a spinlock, or some moral 
 equivalent. Fundamentally, there’s no memory-safe way to implement concurrent 
 one-time execution of arbitrary side effects without a spinlock.

This implementation from the Linux C library will only ever take a lock in the 
rare case where initialization has not already been performed, as far as I can 
tell:
http://searchcode.com/codesearch/view/18325089

Assuming my reading is correct, it only ever hits the slow path if 
initialization has not been performed yet, and multiple threads attempt to do 
it at once, which happens at most once early in startup.

As far as I know, the only significant cost in practice to using pthread_once + 
pthread_getspecific instead of pthread_getspecific_direct is function call 
overhead. That is my recollection from when we switched on Mac.

 
 That’s why requiring concurrent one-time execution of arbitrary side effects 
 in order to access thread-specific memory is broken API.

It's definitely lame, but we have existence proofs that you can still be a lot 
faster than popular system malloc implementations without solving this problem 
(namely FastMalloc on Linux platforms today, and FastMalloc as initially 
deployed on Mac before we adopted pthread_getspecific). Does the new malloc 
implementation access thread-specific data much more frequently?

 C++11 also introduces the thread_local keyword which is likely more readily 
 optimizable than function-call-based APIs where supported.
 
 thread_local might be a reasonable option, if a platform achieves all the 
 other requirements for fast malloc. It’s still too slow, but at least it 
 isn’t slow by definition, and it doesn’t pollute the rest of the code too 
 badly.

Maybe it would be easier to understand what the issue is looking at the code. 

From this and your other posts, it sounds like there might be an issue of code 
pollution/complexity and not just prospective performance.

Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-10-01 Thread Zoltan Horvath
On Tue, Oct 1, 2013 at 3:52 PM, Geoffrey Garen gga...@apple.com wrote:

  So are you proposing to use the system allocator on Windows?

 I’m proposing a two step process:

 (1) Use the system allocator on Windows (and GTK).


 (2) If a port maintainer cares to optimize a given port, without too much
 disruption to mainline code, they may do so.

 FWIW, If I were conducting (2) for Windows, malloc would be pretty far
 down the list of things I started porting.

  The current malloc logic has been the source of a number of mysterious
 crashes on Windows, so reverting to the system allocator might be a good
 thing for stability. I don’t know what the potential performance
 ramifications would be.

 Yes, I’ve heard that on other platforms as well.


This usually happens because the allocation/free mismatches. (In cases such
as memory allocated by TCmalloc via the FastMalloc interface (fastMalloc,
fastNewMalloc) and tried to be freed by the system free.)

I support (1).

I think for (2), it would be better if the port maintainers would just try
to support the core (the new) allocator if it's possible. It would be
better to have only 1 allocator and optionally the system allocator for
special cases.

Cheers,
Zoltan
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] WTF::fastMalloc

2013-09-30 Thread Zoltan Horvath
Hi Geoffrey,

I used to work on memory related topics, while I was working on the
University of Szeged.

Based on a 2.5-year-old measurement (
http://webkit.sed.hu/blog/20100302/war-allocators-qtlaunchers-coast) on the
Qt-port, the page loading on the Methanol test suite was 5% faster (avg)
with TCmalloc than the default system allocator on Linux. The performance
results of the SunSpider suite was similar for both allocators. The memory
consumption was always lower with the default os allocator.

I guess the new allocator only has iOS support. I'm fine with removing
TCmalloc, although this direction might raises further questions, like
removing the custom allocation framework also. Feel free to cc me on bugs,
I can help by contributing some patches.

Cheers,
Zoltan


On Mon, Sep 30, 2013 at 2:48 PM, Geoffrey Garen gga...@apple.com wrote:

 Hi folks.

 I’m planning to remove our years-out-of-date port of TCMalloc, and replace
 it with something that takes maximum advantage of Mac and iOS virtual
 memory, threading, and security APIs.

 I've heard that TCMalloc has caused some problems for non-Mac, non-iOS
 ports in the past. So, if you maintain a port, this change might make
 things simpler for you.

 Are there any ports whose built-in malloc implementations are slow enough
 that they can’t get by without TCMalloc?

 Thanks,
 Geoff
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev