Thanks for the information.

I understand that moving from HTML 4 to HTML 5 is probably a good idea.

However, I am concerned about this statement: "This will require editors to
fix
pages and templates to address wikitext patterns that behave differently
with
RemexHTML".

As you probably know, the supply of content contributors' time is far too
low to
meet the demands of keeping up with everything that ideally would be done
on the content projects.

I am thinking that instead of asking content contributors to spend lots of
hours
(do we know how many? Hundreds? Thousands?) fixing all of these issues,
it would make more sense to develop bots to address them.

Here are a few questions:
1. How many fixes do you think will be needed, for the highest priority
fixes
as well as all fixes?

2. How many hours of volunteer time do you think that these fixes will
require,
for the highest priority fixes as well as all fixes?

3. How feasible would it be to build bots to make 90% of high priority
fixes and
90% of all fixes?

I'm not trying to obstruct technical progress, but I am generally not a fan
of
WMF adding to volunteers' workloads. If the number of changes involved are
small and the number of hours to make them is small, that is less of a
concern
than if we are talking about thousands of changes and hundreds or thousands
of volunteer hours.

Thanks,

Pine


On Thu, Jul 6, 2017 at 5:02 AM, Subramanya Sastry <[email protected]>
wrote:

> How to read this post?
> ----------------------
> * For those without time to read lengthy technical emails,
>   read the TL;DR section.
> * For those who don't care about all the details but want to
>   help with this project, you can read sections 1 and 2 about Tidy,
>   and then skip to section 7.
> * For those who like all their details, read the post in its entirety,
>   and follow the links.
>
> Please ask follow up questions on wiki *on the FAQ’s talk page* [0]. If you
> find a bug, please report it *on Phabricator or on the page mentioned
> above*.
>
> TL;DR
> -----
> The Parsing team wants to replace Tidy with a RemexHTML-based solution on
> the
> Wikimedia cluster by June 2018. This will require editors to fix pages and
> templates to address wikitext patterns that behave differently with
> RemexHTML.  Please see 'What editors will need to do' section on the Tidy
> replacement FAQ [1].
>
> 1. What is Tidy?
> ----------------
> Tidy [2] is a library currently used by MediaWiki to fix some HTML errors
> found in wiki pages.
>
> Badly formed markup is common on wiki pages when editors use HTML tags in
> templates and on the page itself. (Ex: unclosed HTML tags, such as a
> <small>
> without a </small>, are common). In some cases, MediaWiki can generate
> erroneous HTML by itself. If we didn't fix these before sending it to
> browsers, some would display things in a broken way to readers.
>
> But Tidy also does other "cleanup" on its own that is not required for
> correctness. Ex: it removes empty elements and adds whitespace between HTML
> tags, which can sometimes change rendering.
>
> 2. Why replace it?
> ------------------
> Since Tidy is based on HTML4 semantics and the Web has moved to HTML5, it
> also makes some incorrect changes to HTML to 'fix' things that used to not
> work; for example, Tidy will unexpectedly move a bullet list out of a table
> caption even though that's allowed. HTML4 Tidy is no longer maintained or
> packaged. There have also been a number of bug reports filed against Tidy
> [3]. Since Parsoid is based on HTML5 semantics, there are differences in
> rendering between Parsoid's rendering of a page and current read view that
> is based on Tidy.
>
> 3. Project status
> -----------------
> Given all these considerations, the Parsing team started work to replace
> Tidy
> [4] around mid-2015. Tim Starling started this work and after a survey of
> existing options, decided to write a wrapper over a Java-based HTML5
> parser.
> At the time we started the project, we thought we could probably have Tidy
> replaced by mid-2016. Alas!
>
> 4. What is replacing Tidy?
> --------------------------
> Tidy will be replaced by a RemexHTML-based solution that uses the
> RemexHTML[5] library along with some Tidy-compatibility shims to ensure
> better parity with the current rendering. RemexHTML is a PHP library that
> Tim
> wrote with C.Scott’s input that implements the HTML5 parsing spec.
>
> 5. Testing and followup
> -----------------------
> We knew that some pages will be affected and need fixing due to this
> change.
> In order to more precisely identify what that would be, we wanted to do
> some
> thorough testing. So, we built some new tools [6][7] and overhauled and
> upgraded other test infrastructure [8][9] to let us evaluate the impacts of
> replacing Tidy (among other such things in the future) which can be a
> subject
> of a post all on its own.
>
> You can find the details of our testing on the wiki [1][10], but we found
> that a large number of pages had rendering differences. We analyzed the
> results and categorized the source of differences. Based on that, to ease
> the
> process of replacement, we added a bunch of compatibility shims to mimic
> what
> Tidy does. I am skipping the details in this post. Even after that, newer
> testing showed that this nevertheless still leaves us with a few patterns
> that need fixing that we cannot / don't want to work around automatically.
>
> 6. Tools to assist editors: Linter & ParserMigration
> ----------------------------------------------------
> In October 2016, at the parsing team offsite, Kunal ([[User:Legoktm
> (WMF)]])
> dusted off the stalled wikitext linting project [11] and (with the help
> from
> a bunch of people on the Parsoid, db/security/code review areas) built the
> Linter extension that surfaces wikitext errors that Parsoid knows about to
> let editors fix them.
>
> Earlier this year, we decided to use Linter in service of Tidy replacement.
> Based on our earlier testing results, we have added a set of high-priority
> linter categories that identifies specific wikitext markup patterns on wiki
> pages that need to be fixed [12].
>
> Separately, Tim built the ParserMigration extension to let editors evaluate
> their fixes to pages [13]. You can enable this in your editing preferences
> or
> replace '&action=edit' in your url bar with '&action=parsermigration-edit'
> .
>
> 7. What editors have to do
> --------------------------
> The part that you have all been waiting for!
>
> Please see 'What editors will need to do' section on the Tidy replacement
> FAQ
> [1]. We have added simplified instructions, so that even community members
> who do not consider themselves "techies" can still learn about ways to fix
> pages.  We'll keep that section up to date based on feedback and questions.
> But since it is a wiki, please also edit and tweak as required to make the
> text useful for yourselves! This is a first call for fixes and it is about
> the problems defined as "high priority". We'll issue other calls in the
> future for any other necessary Tidy fixups.
>
> Caveats:
>
> * As noted on that page, the linter categories don't cover all the possible
>   sources of rendering differences. For example, there is still T157418
> [14]
>   left to address. For those who have an opinion about this, please chime
> in
>   on that task. We are still evaluating the best solution for this without
>   adding more cruft to wikitext behavior or kicking the cleanup can down
>   the road.
>
> * As the issues in the identified linter categories are fixed, we might be
>   better able to isolate other issues that need addressing.
>
> 8. So, when will Tidy actually be replaced?
> -------------------------------------------
> We really would like to get Tidy removed from the cluster latest by June
> 2018
> (or sooner if possible), and your assistance and prompt attention to these
> markup issues would be very helpful. We will do this in a phased manner on
> different wikis rather than all at once on all wikis.
>
> We really want to do this as smoothly as possible without disrupting the
> work
> of editors or affecting the rendering of the large corpus of pages on the
> various wikis. As you might have gathered from the text above, we have
> built
> and leveraged a wide variety of tools to assist with this.
>
> 9. Monitoring progress
> ----------------------
> In order to monitor progress, we plan to do a weekly (or some such periodic
> frequency) test run that compares the rendering of pages with Tidy and with
> RemexHTML on a large sample of pages (in the 50K range) from a large subset
> of Wikimedia wikis (~50 or so).  This will give us a pulse of how fixups
> are
> going, and when we might be able to flip the switch on different wikis.
>
> Subramanya (Subbu) Sastry
> Parsing Team.
>
> References
> ----------
> 0. https://www.mediawiki.org/wiki/Talk:Parsing/Replacing_Tidy/FAQ
> 1. https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#
> What_will_editors_need_to_do.3F
> 2. https://en.wikipedia.org/wiki/HTML_Tidy
> 3. https://phabricator.wikimedia.org/tag/tidy/
> 4. https://phabricator.wikimedia.org/T89331
> 5. https://github.com/wikimedia/mediawiki-libs-RemexHtml
> 6. https://phabricator.wikimedia.org/T120345
> 7. https://github.com/wikimedia/integration-uprightdiff
> 8. https://github.com/wikimedia/integration-visualdiff
> 9. https://github.com/wikimedia/mediawiki-services-parsoid-testreduce
> 10. https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy
> 11. https://phabricator.wikimedia.org/T48705
> 12. https://www.mediawiki.org/wiki/Help:Extension:Linter#Goal:_
> Replacing_Tidy
> 13. https://www.mediawiki.org/wiki/Help:Extension:Linter#Verifyi
> ng_fixes_for_these_lint_categories
> 14. https://phabricator.wikimedia.org/T157418
>
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to