Some years back I was importing a large number of complex templates to a
wiki that didn't have tidy enabled.  The results were nothing short of
horrendous in a substantial number of cases.  Wiki authors will generally
stop worrying about their code as long as the results look right.  For good
or ill, tidy does a remarkable job of localizing unclosed tags, and often
that is enough to effectively fix the appearance of broken HTML syntax so
it doesn't spill over into other sections.  Without Tidy (or its
equivalent) there will be a lot of template garbage that needs to be
repaired.

The garbage in -> garbage out approach might seem appealing in principle,
but any transition to such a condition is going to dredge up a lot of
malformed HTML code created by wiki editors that we've been hiding for many
years.  If one is going to replace Tidy with something substantially
different in execution, I would suggest that one needs a significant test
suite of complex pages in order to judge how bad the collateral damage is
likely to be, and ideally some set of tools to help editors fix it.

-Robert Rohde

On Thu, Aug 13, 2015 at 7:51 AM, Brian Wolff <[email protected]> wrote:

> On 8/12/15, MZMcBride <[email protected]> wrote:
> > Tim Starling wrote:
> >>https://phabricator.wikimedia.org/T89331
> >>
> >>Running the output of the MediaWiki parser through HTML Tidy always
> >>seemed like a nasty hack. The effects on wikitext syntax are arbitrary
> >>and change from version to version. When we upgrade our Linux
> >>distribution, we sometimes see changes in the HTML generated by given
> >>wikitext, which is not ideal.
> >>
> >>[...]
> >>
> >>We can get nearly the same effect in MediaWiki by replacing the Tidy
> >>transformation stage with an HTML 5 parse followed by serialization of
> >>the DOM back to HTML. This would stabilize wikitext syntax and resolve
> >>several important syntax differences compared to Parsoid.
> >
> > Related tasks:
> >
> > * https://phabricator.wikimedia.org/T4542
> > * https://phabricator.wikimedia.org/T56617
> >
> > It's not clear to me which behaviors from Tidy we want to keep. Looking
> at
> > the various bugs that Tidy has caused, it's apparent that there a number
> > of behaviors we want to disable/avoid.
> >
> > My understanding is that Tidy is not responsible for output sanitization
> > and it's not responsible for preprocessing or parsing. MediaWiki handles
> > all of that elsewhere. If Tidy is only needed for mismatched HTML
> > elements, we could possibly catch and disallow or gracefully handle that
> > specific use-case in MediaWiki. What other beneficial behavior of Tidy
> > would we need to replicate?
> >
> > Or could we replace Tidy with nothing? Relying on the principle of
> > "garbage in, garbage out" seems reasonable in some ways. And modern
> > browsers are fairly adept at handling moderately bad HTML.
> >
> > MZMcBride
> >
> >
>
> The main thing tidy does (imo), is ensure that mismatched html fails
> are localized. When somebody makes a mistake, it can cause the entire
> skin to go whacko. We ideally want to have markup mistakes only affect
> the user generated content (and preferably, only around the area where
> the mistake is).
>
> --bawolff
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to