On Thu, May 5, 2011 at 3:21 AM, Andreas Jonsson
<[email protected]> wrote:
> 2011-05-04 08:13, Tim Starling skrev:
>> On 04/05/11 15:52, Andreas Jonsson wrote:
>>> The time it takes to execute the code that glues together the regexps
>>> will be insignificant compared to actually executing the regexps for any
>>> article larger than a few hundred bytes.  This is at least the case for
>>> the articles are the the easiest for the core parser, which are articles
>>> that contains no markup.  The more markup the slower it will run.  It is
>>> possible that this slowdown will be lessened if compiled with HipHop.
>>> But the top speed of the parser (in bytes/seconds) will be largely
>>> unaffected.
>>
>> PHP execution dominates for real test cases, and HipHop provides a
>> massive speedup. See the previous HipHop thread.
>>
>> http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html
>>
>> Unfortunately, users refuse to write articles consisting only of
>> hundreds of kilobytes of plain text, they keep adding references and
>> links and things. So we don't really care about the parser's "top speed".
>
> We are talking about different things.  I don't consider callbacks made
> when processing "magic words" or "parser functions" being part of the
> actual parsing.  The reference case of no markup input is interesting to
> me as it marks the maximum throughput of the MediaWiki parser, and is
> what you would compare alternative implementations to.  But, obviously,
> if the Barack Obama article takes 22 seconds to render, there are more
> severe problems than parser performance at the moment.

It's a little more complicated than that, and obviously you haven't
spent a lot of time looking at profiling output from parsing the
Barack Obama article if you say that — what, if not the parser, is
slowing down the processing of that article?

Consider the following:

1. Many things that you would exclude from "parsing" like reference
tags and what-not call the parser themselves.
2. Regardless of whether you include the actual callback in your
measurements of parser run time, you need to consider them.
Identifying structures that require callbacks, as well as structures
that don't (such as links, templates, images, and what not) takes
time. While you might reasonably exclude ifexist calls and so on from
parser run time, you most certainly cannot reasonably exclude template
calls, link processing, nor the extra time taken by the preprocessor
to identify such structures.

As Domas says, real world data is king. As far as I know, in the case
of 'a a a a', even if you repeat it for a few MB, virtually no PHP
code is run, because the preprocessor uses strcpsn to identify
structures requiring preprocessing. That's implemented in C — in fact,
for 'a a a' repeated for a few MB, it's my (probably totally wrong)
understanding that the PHP code runs in more or less constant time.
It's the structures that appear in real articles that make the parser
slow.

—Andrew

--
Andrew Garrett
Wikimedia Foundation
[email protected]

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to