On Thu, Jan 6, 2011 at 11:38 AM, Brion Vibber <[email protected]> wrote:
> On Thu, Jan 6, 2011 at 11:01 AM, Jay Ashworth <[email protected]> wrote:
>> > From: "George Herbert" <[email protected]>
>> > I suspect that diffs are relatively rare events in the day to day WMF
>> > processing, though non-trivial.
>>
>> Every single time you make an edit, unless I badly misunderstand the
>> current
>> architecture; that's how it's possible for multiple people editing the
>> same article not to collide unless their edits actually collide at the
>> paragraph level.
>>
>> Not to mention pulling old versions.
>>
>> Can someone who knows the current code better than me confirm or deny?
>>
>
> There's a few separate issues mixed up here, I think.
>
>
> First: diffs for viewing and the external diff3 merging for resolving edit
> conflicts are actually unrelated code paths and use separate diff engines.
> (Nor does diff3 get used at all unless there actually is a conflict to
> resolve -- if nobody else edited since your change, it's not called.)
>
>
> Second: the notion that diffing a structured document must inherently be
> very slow is, I think, not right.
>
> A well-structured document should be pretty diff-friendly actually; our
> diffs are already working on two separate levels (paragraphs as a whole,
> then words within matched paragraphs). In the most common cases, the diffing
> might actually work pretty much the same -- look for nodes that match, then
> move on to nodes that don't; within changed nodes, look for sub-nodes that
> can be highlighted. Comparisons between nodes may be slower than straight
> strings, but the basic algorithms don't need to be hugely different, and the
> implementation can be in heavily-optimized C++ just like our text diffs are
> today.
>
>
> Third: the most common diff view cases are likely adjacent revisions of
> recent edits, which smells like cache. :) Heck, these could be made once and
> then simply *stored*, never needing to be recalculated again.
>
>
> Fourth: the notion that diffing structured documents would be overwhelming
> for the entire Wikimedia infrastructure... even if we assume such diffs are
> much slower, I think this is not really an issue compared to the huge CPU
> savings that it could bring elsewhere.
>
> The biggest user of CPU has long been parsing and re-parsing of wikitext.
> Every time someone comes along with different view preferences, we have to
> parse again. Every time a template or image changes, we have to parse again.
> Every time there's an edit, we have to parse again. Every time something
> fell out of cache, we have to parse again.
>
> And that parsing is *really expensive* on large, complex pages. Much of the
> history of MediaWiki's parser development has been in figuring out how to
> avoid parsing quite as much, or setting limits to keep the worst corner
> cases from bringing down the server farm.
>
> We parse *way*, *wayyyyy* more than we diff.
>[...]

Even if we diff on average 2-3x per edit, we're only doing order ten
edits a second across the projects, right?  Not going to dig up the
current stats, but that's what I remember from last time I looked.

So; priority remains parser and actual used syntax cleanup, from a
sanity point of view (being able to describe the syntax usefully, and
in a way that allows multiple parsers to be written), with diff
management as a distant low-impact priority...


-- 
-george william herbert
[email protected]

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to