On Thu, Jan 6, 2011 at 11:38 AM, Brion Vibber <[email protected]> wrote: > On Thu, Jan 6, 2011 at 11:01 AM, Jay Ashworth <[email protected]> wrote: >> > From: "George Herbert" <[email protected]> >> > I suspect that diffs are relatively rare events in the day to day WMF >> > processing, though non-trivial. >> >> Every single time you make an edit, unless I badly misunderstand the >> current >> architecture; that's how it's possible for multiple people editing the >> same article not to collide unless their edits actually collide at the >> paragraph level. >> >> Not to mention pulling old versions. >> >> Can someone who knows the current code better than me confirm or deny? >> > > There's a few separate issues mixed up here, I think. > > > First: diffs for viewing and the external diff3 merging for resolving edit > conflicts are actually unrelated code paths and use separate diff engines. > (Nor does diff3 get used at all unless there actually is a conflict to > resolve -- if nobody else edited since your change, it's not called.) > > > Second: the notion that diffing a structured document must inherently be > very slow is, I think, not right. > > A well-structured document should be pretty diff-friendly actually; our > diffs are already working on two separate levels (paragraphs as a whole, > then words within matched paragraphs). In the most common cases, the diffing > might actually work pretty much the same -- look for nodes that match, then > move on to nodes that don't; within changed nodes, look for sub-nodes that > can be highlighted. Comparisons between nodes may be slower than straight > strings, but the basic algorithms don't need to be hugely different, and the > implementation can be in heavily-optimized C++ just like our text diffs are > today. > > > Third: the most common diff view cases are likely adjacent revisions of > recent edits, which smells like cache. :) Heck, these could be made once and > then simply *stored*, never needing to be recalculated again. > > > Fourth: the notion that diffing structured documents would be overwhelming > for the entire Wikimedia infrastructure... even if we assume such diffs are > much slower, I think this is not really an issue compared to the huge CPU > savings that it could bring elsewhere. > > The biggest user of CPU has long been parsing and re-parsing of wikitext. > Every time someone comes along with different view preferences, we have to > parse again. Every time a template or image changes, we have to parse again. > Every time there's an edit, we have to parse again. Every time something > fell out of cache, we have to parse again. > > And that parsing is *really expensive* on large, complex pages. Much of the > history of MediaWiki's parser development has been in figuring out how to > avoid parsing quite as much, or setting limits to keep the worst corner > cases from bringing down the server farm. > > We parse *way*, *wayyyyy* more than we diff. >[...]
Even if we diff on average 2-3x per edit, we're only doing order ten edits a second across the projects, right? Not going to dig up the current stats, but that's what I remember from last time I looked. So; priority remains parser and actual used syntax cleanup, from a sanity point of view (being able to describe the syntax usefully, and in a way that allows multiple parsers to be written), with diff management as a distant low-impact priority... -- -george william herbert [email protected] _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
