2011-05-06 03:27, Andrew Garrett skrev:
> On Thu, May 5, 2011 at 3:21 AM, Andreas Jonsson
> <[email protected]> wrote:
>> 2011-05-04 08:13, Tim Starling skrev:
>>> On 04/05/11 15:52, Andreas Jonsson wrote:
>>>> The time it takes to execute the code that glues together the regexps
>>>> will be insignificant compared to actually executing the regexps for any
>>>> article larger than a few hundred bytes.  This is at least the case for
>>>> the articles are the the easiest for the core parser, which are articles
>>>> that contains no markup.  The more markup the slower it will run.  It is
>>>> possible that this slowdown will be lessened if compiled with HipHop.
>>>> But the top speed of the parser (in bytes/seconds) will be largely
>>>> unaffected.
>>>
>>> PHP execution dominates for real test cases, and HipHop provides a
>>> massive speedup. See the previous HipHop thread.
>>>
>>> http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html
>>>
>>> Unfortunately, users refuse to write articles consisting only of
>>> hundreds of kilobytes of plain text, they keep adding references and
>>> links and things. So we don't really care about the parser's "top speed".
>>
>> We are talking about different things.  I don't consider callbacks made
>> when processing "magic words" or "parser functions" being part of the
>> actual parsing.  The reference case of no markup input is interesting to
>> me as it marks the maximum throughput of the MediaWiki parser, and is
>> what you would compare alternative implementations to.  But, obviously,
>> if the Barack Obama article takes 22 seconds to render, there are more
>> severe problems than parser performance at the moment.
> 
> It's a little more complicated than that, and obviously you haven't
> spent a lot of time looking at profiling output from parsing the
> Barack Obama article if you say that — what, if not the parser, is
> slowing down the processing of that article?
> 
> Consider the following:
> 
> 1. Many things that you would exclude from "parsing" like reference
> tags and what-not call the parser themselves.
> 2. Regardless of whether you include the actual callback in your
> measurements of parser run time, you need to consider them.
> Identifying structures that require callbacks, as well as structures
> that don't (such as links, templates, images, and what not) takes
> time. While you might reasonably exclude ifexist calls and so on from
> parser run time, you most certainly cannot reasonably exclude template
> calls, link processing, nor the extra time taken by the preprocessor
> to identify such structures.
> 
> As Domas says, real world data is king. As far as I know, in the case
> of 'a a a a', even if you repeat it for a few MB, virtually no PHP
> code is run, because the preprocessor uses strcpsn to identify
> structures requiring preprocessing. That's implemented in C — in fact,
> for 'a a a' repeated for a few MB, it's my (probably totally wrong)
> understanding that the PHP code runs in more or less constant time.
> It's the structures that appear in real articles that make the parser
> slow.

I'm sorry, I misunderstood the original statement that HipHop would
make _parsing_ significantly faster and questioned that on false
premises, because I'm thinking of the parser and the preprocessor as
distinctly different components.

Let me explain: as I see it, the first step in formalizing wikitext
syntax is to analyze and write a parser that can be used as a drop in
replacement after preprocessing.  The stuff that is preprocessed
cannot be integrated with the parser without sacrificing compatiblity.
Preprocessing is problematic.  It breaks the one-to-one relationship
with the wikitext and the syntax tree, (i.e., it impossible to
serialize a syntax tree back to the same wikitext that generated it.)
Therefore, in a second step, it should be analyzed how the
preprocessed constructions can be integrated with the parser and how
to minimize the damage from this change.

I had not analyzed the parts of the core parser that I consider
"preproprocessing", and it came as a suprise to me that it was as slow
as the Barack Obama benchmark shows.  But integrating template
expansion with the parser would solve this performance problem, and is
therefore in itself a strong argument for working towards replacing
it.  I will write about this on wikitext-l.

Best Regards,

Andreas Jonsson

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to