Hello Shane,
First, great job at looking into the bowels of Trac ;-)
Then, as a general comment, I see that some of your suggestions actually
go against some of the changes I did in #6614, so we have not
surprisingly a trade-off of memory vs. speed. In some environments where
memory is strictly restricted, we have no choice but to optimize for
memory, at the detriment of speed. But in most environments, the extra
memory needed for achieving decent speed might be affordable. So I see
here the opportunity for a configuration setting, something like [trac]
favor_speed_over_memory, defaulting to true, that people having low
resources could turn off.
Now let me comment some of your observations.
Shane Caraveo wrote:
> I was going to do this as a ticket since it's so long, but edgwall likes
> to think of me as a spammer sometimes.
[OT] Even with your Preferences/General/Full name and Email address set?
> This was partly written as a
> response/suggestion to people posting on ticket 7490 as well, so
> includes suggestions.
>
> Shane
>
> ----
>
> I've recently reduced per request time for *simple* template generation
> from 150ms to 45ms on my laptop (dual 1.6ghz core duo 2GB, tested with
> ab against a simple template). I also reduced page load times (reported
> via google page load/firebug) to around a second (from 2-8 seconds in
> some cases). These timings are without any network overhead since I'm
> testing on the local machine. Times are all taken on my laptop with my
> typical environment/applications open (lots of crap).
>
> This is all done with 11.5.
>
> Even with the gains I still get 100% cpu spikes every request, but at
> least it's a shorter duration. Still digging further on that. I also
> have done a bit on overall browser page load.
>
That spike is probably occurring during Genshi's generate and render calls.
> Here's my current brain dump. If some of my conclusions are silly,
> please let me know why.
>
> == General ==
>
> In general there are only a couple big wins. For me it was removing
> gc.collect (see trac.main) and the timing and estimation plugin.
> Everything else was small potatoes in comparison (10ms here, 5ms there),
> but together they added up to a good 40-50ms per request. Think of it
> this way, using 100%cpu and 50ms/request limits you to a max of 20
> requests/second/cpu. Every ms counts if we want decent throughput. I'd
> like to get under 30ms.
>
The gc on every request is the typical memory vs. speed trade-off. If it
can be shown that despite not doing gc after every request, the memory
usage stays within bound, then I think we can make that optional. As you
said elsewhere, it's quite possible that this explicit gc simply hides a
real memory leak that can be avoided by other means (like fixing the db
pool issue with PostgreSQL).
> === How I timed ===
>
> My test setup is apache prefork with wsgi deamon mode, one process,
> mod_deflate enabled. My laptop is pretty loaded, very little free
> memory but generally enough that trac shouldn't be swapping. Postgres
> is the database, running on the laptop as well.
>
> For timing template generation I used ApacheBench and tossed out
> "warmup" requests, sometimes testing with keep alive (getting slightly
> better req/s)
>
> {{{
> ab [-k] -c 1 -n 10 url
> }}}
>
> With zero network latency, the timing is essentially how much time trac
> is taking to generate the content. I tested against a simple template
> generated from a small plugin that just renders a template file (this is
> to involve as little other processing such as querying the ticket table
> or permissions), and against WikiStart.
>
> For "page load" I used Google Page Load extension for firebug and
> firefox 3.5. I also tested with Safari 4 which has similar
> functionality now, it's much faster than firefox :(
>
> I tested with plugins enabled since I wanted to also discover whether
> any plugins were causing me grief, some were.
>
> I tested against empty projects, so table indexing or table size are not
> involved. I also tested against an small sized project in some
> instances, which showed up issues in a couple plugins (eg. t&e).
>
> === Profiling issues ===
>
> It seems to me that cProfile adds times from generators to the function
> that uses them, masking where the real time is spent. That makes
> profiling Genshi terribly difficult. I also tried dtrace out, it gives
> the same rough data as cProfile. Knowing a tiny bit about the python
> engine (having written a c level debugger extension), I am figuring that
> it has something to do with how generators work in the python core.
>
> A couple areas that cProfile claimed a lot of exclusive time in Genshi I
> rewrote in C. This didn't change my performance at all, but did shift
> timing to other area's which helped me find a couple of the actual time
> sinks in Genshi (see below).
>
> == For general performance ==
>
> Use either a multithreaded apache if you use mod_python, or use wsgi in
> multithreaded deamon mode. It wont remove high-cpu slow requests, but
> it does lower memory use to some extent. While that didn't improve the
> per-request performance, I got higher concurrent throughput with that
> setup. Apache prefork with mod_python or wsgi in embeded mode will
> chew up memory, and due to high cpu usage, processes will be fighting
> each other more for resources.
>
> === customizations and plugins ===
>
> The single largest gain I got with plugins was by removing timing and
> estimation plugin (primarily seen in ticket reports page where it does
> another sql query for each ticket, doubling the request time). It's
> stream filters slow lots of stuff down. Removing other plugins with
> stream filters or reworking them to not use stream filters when possible
> is a win. In some instances, TracHours plugin is a better solution if
> you don't need all the t&e features.
>
> Get rid of stuff like the google analytics plugin and just add the code
> to a site or theme template. You'll get better performance. Its not
> that it's a bad plugin, but it does something simple that adds *genshi*
> overhead.
>
> Examine any custom templates for extraneous py:match use, get rid of it.
> I had one match in the site template that was removing something from
> an admin panel, which I had forgotten about, big mistake and it cost on
> the performance of *every* page. I directly patched the template
> instead and saved myself 10ms per request. This kind of thing doesn't
> show up in the profiler, it was some debug dumps I inserted deep into
> the Genshi bowels that showed me the light.
>
Yes, py:match directives are real killers. At the time an extra py:match
was introduced for the theme support, I remember having measured a 10%
increase in the average time needed for any request.
> Make sure all static resources for all plugins are exported. trac-admin
> deploy will only deploy static resources for enabled plugins in that
> project.
>
> === javascript ===
>
> This is mostly about page load times.
>
> minify all the javascript output by trac-admin deploy, and make sure it
> gets cache headers via apache conf.
>
> Upgrade to jquery 1.3, it's much faster. trac trunk has done this, and
> there is a diff somewhere that shows what type of changes have to be
> made. You'd have to examine any plugins for similar jquery updates that
> need to be done.
>
Backport from trunk to 0.11-stable welcomed ;-)
> If you use firebug, be sure you have disabled it (or at least disable
> the console tab) for trac pages, the console somehow interrupts and adds
> a couple seconds delay in the browser.
>
> === css ===
>
> I haven't got to this yet, but there is a lot of slow css there.
> Removing all the css as a test, my page load in the browser was 200ms
> faster. Google Page Load is good at reporting what css *might* be slow,
> it reports lots of inefficient css in trac.
>
> == trac issues ==
>
> === gc.collect ===
>
> The next big item was removing gc.collect. If I recall correctly,
> gc.collect took at least 50ms of the request time, sometimes upwards of
> 100ms. It also hides other bugs (ticket:8443 for example, the postgres
> idle bug came back without gc.collect, I've got a patch in that bug).
> gc.collect could be done in a background thread from tracd, but I
> wouldn't suggest background threads for mod_python prefork. With wsgi
> daemon mode, just limit the number of requests before restart to
> something you think is fine and test memory consumption. I think the
> use of gc.collect should be seriously reconsidered.
>
> I have a sneaking suspicion (unproven) that people who use mod_python
> and claim turning off keep alive and/or mod_deflate are having problems
> due to gc.collect. As I understand apache filters (e.g. mod_deflate)
> they wont finish up the request until the mod_python handler has
> returned. So that extra time in gc.collect delays the request being
> returned, which delays mod_deflate finishing. It also delays the start
> of the next request over a persistent socket connection (ie. keep alive).
>
With regard to mod_deflate, I'm not sure how an extra 100ms can explain
the reported difference in behavior.
IIUC, you're using mod_deflate without any trouble, and switching it off
doesn't make a difference?
Was that also the case for you with 0.11.4?
> === filters ===
>
> Going through all the match_request implementations and removing
> permission checking (put it in process_request), make sure regex matches
> are precompiled, and generally simplifying things helps. There are a
> number of those in trac core, but plugins are greater abusers in this
> area. Also examine any IRequestFilter use and simplify.\
>
Not sure if the advice of removing permission checking in match_request
is a good one.
If done after the path_info test, doing the permission check shouldn't
have a big impact and might be needed to enable the use of a fallback
handler.
> === trac.config ===
>
> Other than Genshi, profiling showed trac.config to be the single largest
> time on simple template generation. Profiling showed me that the get
> method in the Configuation class (trac.config.Configuration) was slow.
> I added caching there for a few extra milliseconds boost. I'm also
> looking at removing the auto-reload if the ini file changes, maybe using
> spread or memcached to create reload notifications, to get rid of file
> stats, but timing doesn't show this to be a large issue on my laptop.
>
Interesting, can you make a ticket out of that?
> === repository ===
>
> while I still want to remove the sync on every request (get rid of
> system calls/file stats), I have been unable to show that performance
> changes much when I test with it removed. There are still bigger fish
> to fry.
>
This will be addressed in the multirepos branch, I think we discussed
making a setting for this, in order to keep a backward compatible
behavior for the default repository.
> === database pool ===
>
> Watching the postgres log file, I can tell that a lot of cursors get
> created without being actually used to do a query. This shows up
> because psycopg2 executes a BEGIN when a cursor is created. I haven't
> yet looked into where that is happening, but it's extra work the system
> is doing for nothing.
>
That's probably also worth a ticket on its own, unless this could be
handled in #8443.
> === wiki ===
>
> The current wiki parser design is slow, doing the same large regex over
> each line. I think a general redesign to use re.finditer, rather than
> line split then re.match, would help improve performance here. However,
> post-caching the content would be better. I'm experimenting with
> partial caching of the wiki content and have reduced my request timing
> on WikiStart from 120ms to 75ms while still getting correct content.
> The patch I have doesn't cache macro's unless the macro arguments have
> 'cache' in them (good for page outline macro, which btw seems to
> re-parse the entire page, macros included). There may be other issues
> with the approach I have taken, I haven't tested it much. Once I get
> further, I might actually pre-process when the wiki page is saved and
> stash the result into the database, avoiding the parse in most requests.
>
Interesting as well, be sure to read the discussions about caching wiki
content (#7739 and #1216).
> == Genshi ==
>
> Use Genshi trunk, some performance gain there.
>
> With Genshi, one big win is turning off the whitespace filter, I think I
> got around 10ms on a small page, the gain would be larger on larger
> pages. I did that by patching trac.web.chrome where there is a call to
> stream.render, and adding strip_whitespace=False. While this increases
> the file size delivered slightly (and adds a lot of whitespace to the
> source view), that can be countered by mod_deflate. The namespace
> flattener is another slow filter, but it's necessary.
>
> Another win with Genshi is patching genshi.output.encode and getting rid
> of the iteration that calls _encode a bazillion times. Instead, just do:
>
> {{{
> out.write(_encode(u''.join(list(iterator))))
> }}}
>
> This is much faster.
>
The above was used before #6614, but this triggers the creation of a
huge list and huge unicode string. We could also revert to doing that
when `favor_speed_over_memory` is true.
> I should note, I also tried changing trac so that the output went
> straight through to the wsgi layer (rather than buffering it in a
> cStringIO object). While this improved the initial wait time
> dramatically, it actually slowed down the overall request time. I
> believe this slowdown was also due to the _encode calls inside the
> iteration over the stream. I'm still going to experiment with this more
> at some point.
>
Interesting, but this will break the keep-alive support for tracd
(introduced in 0.11.5) as you won't be able to get the Content-Length
this way (one solution to this would be to add support for chunked
transfer encoding to tracd).
> Overall, I like Genshi, now that I kind of get how it works it has a
> certain elegance, but I am left unconvinced that it's design and
> flexibility outweighs the performance issues, and the added difficulty
> in figuring them out (refer to profiling at the top).
Oh yes, same feeling here ;-)
> One way I'm
> thinking of working around this is to beef up the xmlrpc interfaces and
> do everything with ajax.
>
Maybe. Another idea would be to make more things "opaque" to Genshi, by
pre-rendering chunks that we know to be irrelevant to further
transformations. Typical example would be the diffs and the highlighted
lines in the source browser.
> == Misc ==
>
> === plugins ===
>
> I don't think most plugin authors know where they will generate
> performance penalties. It's hard to know all those details without
> spending a lot of time on it.
>
> generally pay attention to any plugins that use stream filters or
> request filters. These are called for every request and more often than
> not do some thing that is time intensive. Additionally, IMO, a request
> handlers match_request should never do anything beyond
> req.path_info.startswith(/mypath). Many match_request implementations
> do a lot more, which gets executed for every request.
>
Good tips, this should definitely go into the wiki somewhere
(TracDev/Performance?)
> === wysiwyg plugin ===
>
> this adds a large js file to *every* request whether useful or not. I
> removed the large script from the request filter, then I changed the
> load script to check and see if there is a textarea element in the dom,
> then load the larger script if necessary.
>
Again, thanks for working on this!
-- Christian
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Trac
Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/trac-dev?hl=en
-~----------~----~----~----~------~----~------~--~---