Hi Subbu !

I have barely started using WPCleaner to fix some errors reported by
Linter, and I know I still have work to do on WPCleaner to make it easier
for users.
But I have a few questions / suggestions regarding Linter for the moment:

   - Is is possible to retrieve also the localized names of the Linter
   categories and priorities: for example, on frwiki, you can see on the
   Linter page [1] that the high priority is translated into "Priorité haute"
   and that self-closed-tag has a user friendly name "Balises auto-fermantes".
   I don't see the localized names in the informations sent by the API for
   siteinfo.
   - Where is it possible to change the description displayed in each page
   dedicated to a category ? For example, the page for self-closed-ags [2] is
   very short. It would be nice to be able to add a description of what the
   error is, what problems it can cause and what are the solutions to fix it
   (or to be able to link to a page explaining all that).
   - In the page dedicated to a category, there's a column telling if the
   problem is due to one template (and which one) or by several templates, but
   I don't get this information in the REST API for Linter. Is it possible to
   have it in the API result or should I deduce it myself where the offset
   given by the API matches a call to a template?


[1] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:LintErrors
[2] https://fr.wikipedia.org/wiki/Sp%C3%A9cial:LintErrors/self-closed-tag



On Thu, Jul 6, 2017 at 2:02 PM, Subramanya Sastry <ssas...@wikimedia.org>
wrote:

> How to read this post?
> ----------------------
> * For those without time to read lengthy technical emails,
>   read the TL;DR section.
> * For those who don't care about all the details but want to
>   help with this project, you can read sections 1 and 2 about Tidy,
>   and then skip to section 7.
> * For those who like all their details, read the post in its entirety,
>   and follow the links.
>
> Please ask follow up questions on wiki *on the FAQ’s talk page* [0]. If you
> find a bug, please report it *on Phabricator or on the page mentioned
> above*.
>
> TL;DR
> -----
> The Parsing team wants to replace Tidy with a RemexHTML-based solution on
> the
> Wikimedia cluster by June 2018. This will require editors to fix pages and
> templates to address wikitext patterns that behave differently with
> RemexHTML.  Please see 'What editors will need to do' section on the Tidy
> replacement FAQ [1].
>
> 1. What is Tidy?
> ----------------
> Tidy [2] is a library currently used by MediaWiki to fix some HTML errors
> found in wiki pages.
>
> Badly formed markup is common on wiki pages when editors use HTML tags in
> templates and on the page itself. (Ex: unclosed HTML tags, such as a
> <small>
> without a </small>, are common). In some cases, MediaWiki can generate
> erroneous HTML by itself. If we didn't fix these before sending it to
> browsers, some would display things in a broken way to readers.
>
> But Tidy also does other "cleanup" on its own that is not required for
> correctness. Ex: it removes empty elements and adds whitespace between HTML
> tags, which can sometimes change rendering.
>
> 2. Why replace it?
> ------------------
> Since Tidy is based on HTML4 semantics and the Web has moved to HTML5, it
> also makes some incorrect changes to HTML to 'fix' things that used to not
> work; for example, Tidy will unexpectedly move a bullet list out of a table
> caption even though that's allowed. HTML4 Tidy is no longer maintained or
> packaged. There have also been a number of bug reports filed against Tidy
> [3]. Since Parsoid is based on HTML5 semantics, there are differences in
> rendering between Parsoid's rendering of a page and current read view that
> is based on Tidy.
>
> 3. Project status
> -----------------
> Given all these considerations, the Parsing team started work to replace
> Tidy
> [4] around mid-2015. Tim Starling started this work and after a survey of
> existing options, decided to write a wrapper over a Java-based HTML5
> parser.
> At the time we started the project, we thought we could probably have Tidy
> replaced by mid-2016. Alas!
>
> 4. What is replacing Tidy?
> --------------------------
> Tidy will be replaced by a RemexHTML-based solution that uses the
> RemexHTML[5] library along with some Tidy-compatibility shims to ensure
> better parity with the current rendering. RemexHTML is a PHP library that
> Tim
> wrote with C.Scott’s input that implements the HTML5 parsing spec.
>
> 5. Testing and followup
> -----------------------
> We knew that some pages will be affected and need fixing due to this
> change.
> In order to more precisely identify what that would be, we wanted to do
> some
> thorough testing. So, we built some new tools [6][7] and overhauled and
> upgraded other test infrastructure [8][9] to let us evaluate the impacts of
> replacing Tidy (among other such things in the future) which can be a
> subject
> of a post all on its own.
>
> You can find the details of our testing on the wiki [1][10], but we found
> that a large number of pages had rendering differences. We analyzed the
> results and categorized the source of differences. Based on that, to ease
> the
> process of replacement, we added a bunch of compatibility shims to mimic
> what
> Tidy does. I am skipping the details in this post. Even after that, newer
> testing showed that this nevertheless still leaves us with a few patterns
> that need fixing that we cannot / don't want to work around automatically.
>
> 6. Tools to assist editors: Linter & ParserMigration
> ----------------------------------------------------
> In October 2016, at the parsing team offsite, Kunal ([[User:Legoktm
> (WMF)]])
> dusted off the stalled wikitext linting project [11] and (with the help
> from
> a bunch of people on the Parsoid, db/security/code review areas) built the
> Linter extension that surfaces wikitext errors that Parsoid knows about to
> let editors fix them.
>
> Earlier this year, we decided to use Linter in service of Tidy replacement.
> Based on our earlier testing results, we have added a set of high-priority
> linter categories that identifies specific wikitext markup patterns on wiki
> pages that need to be fixed [12].
>
> Separately, Tim built the ParserMigration extension to let editors evaluate
> their fixes to pages [13]. You can enable this in your editing preferences
> or
> replace '&action=edit' in your url bar with '&action=parsermigration-edit'
> .
>
> 7. What editors have to do
> --------------------------
> The part that you have all been waiting for!
>
> Please see 'What editors will need to do' section on the Tidy replacement
> FAQ
> [1]. We have added simplified instructions, so that even community members
> who do not consider themselves "techies" can still learn about ways to fix
> pages.  We'll keep that section up to date based on feedback and questions.
> But since it is a wiki, please also edit and tweak as required to make the
> text useful for yourselves! This is a first call for fixes and it is about
> the problems defined as "high priority". We'll issue other calls in the
> future for any other necessary Tidy fixups.
>
> Caveats:
>
> * As noted on that page, the linter categories don't cover all the possible
>   sources of rendering differences. For example, there is still T157418
> [14]
>   left to address. For those who have an opinion about this, please chime
> in
>   on that task. We are still evaluating the best solution for this without
>   adding more cruft to wikitext behavior or kicking the cleanup can down
>   the road.
>
> * As the issues in the identified linter categories are fixed, we might be
>   better able to isolate other issues that need addressing.
>
> 8. So, when will Tidy actually be replaced?
> -------------------------------------------
> We really would like to get Tidy removed from the cluster latest by June
> 2018
> (or sooner if possible), and your assistance and prompt attention to these
> markup issues would be very helpful. We will do this in a phased manner on
> different wikis rather than all at once on all wikis.
>
> We really want to do this as smoothly as possible without disrupting the
> work
> of editors or affecting the rendering of the large corpus of pages on the
> various wikis. As you might have gathered from the text above, we have
> built
> and leveraged a wide variety of tools to assist with this.
>
> 9. Monitoring progress
> ----------------------
> In order to monitor progress, we plan to do a weekly (or some such periodic
> frequency) test run that compares the rendering of pages with Tidy and with
> RemexHTML on a large sample of pages (in the 50K range) from a large subset
> of Wikimedia wikis (~50 or so).  This will give us a pulse of how fixups
> are
> going, and when we might be able to flip the switch on different wikis.
>
> Subramanya (Subbu) Sastry
> Parsing Team.
>
> References
> ----------
> 0. https://www.mediawiki.org/wiki/Talk:Parsing/Replacing_Tidy/FAQ
> 1. https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#
> What_will_editors_need_to_do.3F
> 2. https://en.wikipedia.org/wiki/HTML_Tidy
> 3. https://phabricator.wikimedia.org/tag/tidy/
> 4. https://phabricator.wikimedia.org/T89331
> 5. https://github.com/wikimedia/mediawiki-libs-RemexHtml
> 6. https://phabricator.wikimedia.org/T120345
> 7. https://github.com/wikimedia/integration-uprightdiff
> 8. https://github.com/wikimedia/integration-visualdiff
> 9. https://github.com/wikimedia/mediawiki-services-parsoid-testreduce
> 10. https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy
> 11. https://phabricator.wikimedia.org/T48705
> 12. https://www.mediawiki.org/wiki/Help:Extension:Linter#Goal:_
> Replacing_Tidy
> 13. https://www.mediawiki.org/wiki/Help:Extension:Linter#Verifyi
> ng_fixes_for_these_lint_categories
> 14. https://phabricator.wikimedia.org/T157418
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to