Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-20 Thread Bartosz Dziewoński
On Thu, 20 Aug 2015 02:32:12 +0200, Tim Starling tstarl...@wikimedia.org  
wrote:



On 20/08/15 01:21, Erwin Dokter wrote:

I mentioned this once before:

http://www.htacg.org/tidy-html5/

While Tidy died in 2008, this fork lives on and is HTML5 aware. That
will at least solve a lot of problems *caused* by Tidy, such as not
allowing block elements inside inline elemensts (which is allowed in
HTML5).


HTML 5 has not significantly relaxed the rules about block elements
inside inline elements. The terminology has changed: now instead of
inline elements we have phrasing content and instead of block
elements we have flow content. You're still not allowed to put a
div inside a span, because span is phrasing content and div  
isn't.


Erwin might be referring to T73962 (adivFoo/div/a is changed to  
a/adivFoo/div by Tidy), which is related to a change in semantics  
in HTML 5 (previously a was an inline element, now it is transparent).


[T73962] https://phabricator.wikimedia.org/T73962

--
Bartosz Dziewoński

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-19 Thread Brandon Black
On Wed, Aug 19, 2015 at 1:22 PM, MZMcBride z...@mzmcbride.com wrote:
 Bartosz wrote:
 We really do need this feature. Not anything else that Tidy does, most
of its behavior is actually damaging, but we need to match the open and
close tags to prevent the interface from getting jumbled.

 My reading of this thread is that this is the consensus view. The problem,
 as I see it, is that Tidy has been deployed long enough that some users
 are also relying on all of its other bad behaviors. It seems to me that a
 replacement for Tidy either has to reimplement all of its unwanted
 behaviors to avoid breakage with current wikitext or it has to break an
 unknown amount of current wikitext.

My $0.02 from the peanut gallery: If we fixed up the bulk of the most
common cases we can (where the bad HTML is not the result of an edit
error), could we keep a Tidy/HTML5 type of thing around, but move it
to edit validation rather than render output processing?  We could
start by leaving the current output-side code alone, and warning (to
the user as a minor info blurb on edit submission, and in our logs)
about edits that fail validation, so that we can get some idea of the
scope and causes of the problem, fix what we can, and then evaluate
whether we can eventually start flat-out rejecting the minority of
edits that fail validation and then eventually remove the tidy on the
output side.  That ignores the whole problem of existing bad html
already in the DB, of course, but that could probably be fixed with a
one-time job...

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-19 Thread Ricordisamoa

Il 19/08/2015 15:46, Brandon Black ha scritto:

On Wed, Aug 19, 2015 at 1:22 PM, MZMcBride z...@mzmcbride.com wrote:

Bartosz wrote:

We really do need this feature. Not anything else that Tidy does, most
of its behavior is actually damaging, but we need to match the open and
close tags to prevent the interface from getting jumbled.

My reading of this thread is that this is the consensus view. The problem,
as I see it, is that Tidy has been deployed long enough that some users
are also relying on all of its other bad behaviors. It seems to me that a
replacement for Tidy either has to reimplement all of its unwanted
behaviors to avoid breakage with current wikitext or it has to break an
unknown amount of current wikitext.

My $0.02 from the peanut gallery: If we fixed up the bulk of the most
common cases we can (where the bad HTML is not the result of an edit
error), could we keep a Tidy/HTML5 type of thing around, but move it
to edit validation rather than render output processing?  We could
start by leaving the current output-side code alone, and warning (to
the user as a minor info blurb on edit submission, and in our logs)
about edits that fail validation, so that we can get some idea of the
scope and causes of the problem, fix what we can, and then evaluate
whether we can eventually start flat-out rejecting the minority of
edits that fail validation and then eventually remove the tidy on the
output side.  That ignores the whole problem of existing bad html
already in the DB, of course, but that could probably be fixed with a
one-time job...


Keep in mind that a lot of templates intentionally consist of 'broken' 
HTML that is then 'put back together' in articles...


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-19 Thread Tim Starling
On 13/08/15 15:43, MZMcBride wrote:
 Or could we replace Tidy with nothing? Relying on the principle of
 garbage in, garbage out seems reasonable in some ways. And modern
 browsers are fairly adept at handling moderately bad HTML.

The HTML 5 spec makes a distinction between valid, balanced HTML and
error recovery algorithms. Browsers are basically the only clients
able to handle moderately bad HTML, and as I've previously said in
discussions of HTML 5 output, I don't think it is acceptable to screw
over all non-browser clients by sending output that relies on obscure
details of the HTML 5 spec. I think XHTML or something close to it is
an appropriate machine-readable output format.

Have you looked at my survey on the bug? Compliant HTML 5 parsers are
10-30k source lines and are in pretty short supply.

Wikitext is not meant to be easily machine-readable, it is meant to be
easily human-writable. Unbalanced tags in HTML are errors, but in
wikitext they are allowed. This is a design choice. Most humans don't
really care about the spec, they just want the machine to figure out
what they meant.

And, as several others have noted, you can't just disable Tidy, since
the effects of unclosed tags are not confined to the content area, and
there is a large amount of existing content that depends on it. I have
seen the effects of Tidy being accidentally disabled on the English
Wikipedia, it is not pleasant.

Am I correct in saying that MZMcBride is the only person in this
thread in favour of the idea of getting rid of HTML cleanup?


By the way, you can see my work in progress on an HTML reserializer
web service in the mediawiki/services/html5depurate project on Gerrit:

https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/services/html5depurate+branch:master,n,z

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-19 Thread MZMcBride
Tim Starling wrote:
The HTML 5 spec makes a distinction between valid, balanced HTML and
error recovery algorithms. Browsers are basically the only clients
able to handle moderately bad HTML, and as I've previously said in
discussions of HTML 5 output, I don't think it is acceptable to screw
over all non-browser clients by sending output that relies on obscure
details of the HTML 5 spec. I think XHTML or something close to it is
an appropriate machine-readable output format.

Machine-readable output format? Are you suggesting that there would be a
change from the current policy of telling everyone who screen-scrapes HTML
not to ever do it and to instead use api.php? Otherwise, given that the
majority of our actual traffic comes from actual browsers, as I understand
it, I'm not sure I see which clients you're trying to serve.

And, as several others have noted, you can't just disable Tidy, since
the effects of unclosed tags are not confined to the content area, and
there is a large amount of existing content that depends on it. I have
seen the effects of Tidy being accidentally disabled on the English
Wikipedia, it is not pleasant.

Am I correct in saying that MZMcBride is the only person in this
thread in favour of the idea of getting rid of HTML cleanup?

I think it depends what you mean by HTML cleanup. Are you referring only
to fixing mismatched HTML elements or are you also referring to
reimplementing all of the other behavior that Tidy brings in?

Bartosz wrote:
 We really do need this feature. Not anything else that Tidy does, most
of its behavior is actually damaging, but we need to match the open and
close tags to prevent the interface from getting jumbled.

My reading of this thread is that this is the consensus view. The problem,
as I see it, is that Tidy has been deployed long enough that some users
are also relying on all of its other bad behaviors. It seems to me that a
replacement for Tidy either has to reimplement all of its unwanted
behaviors to avoid breakage with current wikitext or it has to break an
unknown amount of current wikitext.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-19 Thread Subramanya Sastry

On 08/19/2015 08:22 AM, MZMcBride wrote:

And, as several others have noted, you can't just disable Tidy, since
the effects of unclosed tags are not confined to the content area, and
there is a large amount of existing content that depends on it. I have
seen the effects of Tidy being accidentally disabled on the English
Wikipedia, it is not pleasant.

Am I correct in saying that MZMcBride is the only person in this
thread in favour of the idea of getting rid of HTML cleanup?

I think it depends what you mean by HTML cleanup. Are you referring only
to fixing mismatched HTML elements or are you also referring to
reimplementing all of the other behavior that Tidy brings in?

Bartosz wrote:

We really do need this feature. Not anything else that Tidy does, most
of its behavior is actually damaging, but we need to match the open and
close tags to prevent the interface from getting jumbled.

My reading of this thread is that this is the consensus view. The problem,
as I see it, is that Tidy has been deployed long enough that some users
are also relying on all of its other bad behaviors. It seems to me that a
replacement for Tidy either has to reimplement all of its unwanted
behaviors to avoid breakage with current wikitext or it has to break an
unknown amount of current wikitext.
In response to both these queries, see this snippet from my earlier post 
on this thread ( 
https://lists.wikimedia.org/pipermail/wikitech-l/2015-August/082806.html )


   Even replacing it with a HTML5 parser (as per the current
plan) is not entirely straightforward simply because of all the other
unrelated-to-html5-semantics behavior. Part of the task of replacing
Tidy is to figure out all the ways those pages might break and the best
way to handle that breakage.

Also see https://phabricator.wikimedia.org/T89331#1499979 about how we 
might go about evaluating this.


So, we aren't saying we'll implement those Tidy behaviors here. Part of 
the solution might very well be to break some of that Tidy behavior and 
have the pages be fixed up (bots, manually, however). In any case, the 
first step is to understand those impacts.


Subbu.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-19 Thread Erwin Dokter

I mentioned this once before:

http://www.htacg.org/tidy-html5/

While Tidy died in 2008, this fork lives on and is HTML5 aware. That 
will at least solve a lot of problems *caused* by Tidy, such as not 
allowing block elements inside inline elemensts (which is allowed in HTML5).


Can we at least evaluate if this is a suitable interim solution?

Regards,
--
Erwin Dokter


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-19 Thread Tim Starling
On 20/08/15 01:21, Erwin Dokter wrote:
 I mentioned this once before:
 
 http://www.htacg.org/tidy-html5/
 
 While Tidy died in 2008, this fork lives on and is HTML5 aware. That
 will at least solve a lot of problems *caused* by Tidy, such as not
 allowing block elements inside inline elemensts (which is allowed in
 HTML5).
 
 Can we at least evaluate if this is a suitable interim solution?

That's not a solution to the problems that we are trying to solve.

As I said in my original post, my number one problem with Tidy is that
it changes. So I am very happy that it is not in active development.
Switching to a fork that is actively maintained would be much worse.
It would be like the switch from Tidy to the proposed HTML
reserializer web service, except that the pain would be repeated every
time we upgrade our Linux distribution.

The other problem with Tidy is that it is poorly specified and has
only one implementation. Switching to a fork of it doesn't improve the
situation.

HTML 5 has not significantly relaxed the rules about block elements
inside inline elements. The terminology has changed: now instead of
inline elements we have phrasing content and instead of block
elements we have flow content. You're still not allowed to put a
div inside a span, because span is phrasing content and div isn't.

The children column here has a summary:

http://www.w3.org/TR/html5/index.html#elements-1

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-18 Thread David Gerard
On 18 August 2015 at 04:15, MZMcBride z...@mzmcbride.com wrote:
 Brian Wolff wrote:

I dont know about that. Viz editor is targeting ordinary tasks. Its the
complex things that mess stuff up.

 In most contexts, solving the ordinary/common cases is a pretty big win.


Or when it turns a complex task into a simple one, e.g. table editing
(one click to remove a column).


- d.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-18 Thread MZMcBride
Subramanya Sastry wrote:
* Unclosed HTML tags (very common)
* Misnested tags
* Misnesting of tags (ex: links in links .. [http://foo.bar this is a
[[foobar]] company])
* Fostered content in tables
(tablethis-content-will-show-up-outside-the-tabletrtd
/td/tr/table)
... this has been one of the biggest source of complexity inside Parsoid
... in combination with templates, this is nasty.
* Other ways in which HTML5 content model might be violated. (ex:
small\n*a\n*b\n/small)
* Look at the parser tests file and see all the tests we've added with
annotations that say php parser relies on tidy

I don't see why we would want to incur the maintenance cost of continuing
to support any of these bad inputs. I think we should look to deprecate,
not replace, Tidy. This is a case of the cure being worse than the disease.

So, you cannot just rip out Tidy and not replace it with something in
its place. Even replacing it with a HTML5 parser (as per the current
plan) is not entirely straightforward simply because of all the other
unrelated-to-html5-semantics behavior. Part of the task of replacing
Tidy is to figure out all the ways those pages might break and the best
way to handle that breakage.

We shouldn't rip out Tidy immediately, we should implement a means of
disabling Tidy on a per-page or per-user basis and allow the wiki process
to correct bad markup over time. Cunningham's Law applies here.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-18 Thread Derk-Jan Hartman
If we want to do away with Tidy, we will have to make all editors perfect
html authors, or we risk them damaging pages so much that they potentially
can't access the edit button anymore. As far as i'm concerned, this is what
Tidy does primarily. Isolate errors in the content in such a way that it
cannot influence the rest of the interface of the website. And yes I do
regularly see such problem in MediaWiki instances that do not run Tidy.

Rule one of security. Always have multiple layers of defense. Yes we should
reduce the amount of problems and make them more visible, but that doesn't
mean we don't still need a correctional method as a fallback.

DJ

On Tue, Aug 18, 2015 at 3:04 PM, David Gerard dger...@gmail.com wrote:

 On 18 August 2015 at 04:15, MZMcBride z...@mzmcbride.com wrote:
  Brian Wolff wrote:

 I dont know about that. Viz editor is targeting ordinary tasks. Its the
 complex things that mess stuff up.

  In most contexts, solving the ordinary/common cases is a pretty big win.


 Or when it turns a complex task into a simple one, e.g. table editing
 (one click to remove a column).


 - d.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-18 Thread Subramanya Sastry

On 08/18/2015 07:58 AM, MZMcBride wrote:

Subramanya Sastry wrote:

* Unclosed HTML tags (very common)
* Misnested tags
* Misnesting of tags (ex: links in links .. [http://foo.bar this is a
[[foobar]] company])
* Fostered content in tables
(tablethis-content-will-show-up-outside-the-tabletrtd
/td/tr/table)
... this has been one of the biggest source of complexity inside Parsoid
... in combination with templates, this is nasty.
* Other ways in which HTML5 content model might be violated. (ex:
small\n*a\n*b\n/small)
* Look at the parser tests file and see all the tests we've added with
annotations that say php parser relies on tidy

I don't see why we would want to incur the maintenance cost of continuing
to support any of these bad inputs. I think we should look to deprecate,
not replace, Tidy. This is a case of the cure being worse than the disease.


Are you suggesting that you get rid of wikitext editing? If not, you 
cannot assume editors are going to write perfect markup.


What is needed is a way to define DOM scopes in wikitext and enforce 
well-formedness within scopes. So, for example, template output can be 
considered a DOM scope (either opt-in or opt-out). If we felt bold, we 
can define a list to be a DOM scope .. or a table to be a DOM scope ... 
or a image caption to be a DOM scope, and so on.


Rather than expect editors to write perfect markup, we should be 
thinking about sane semantics for them like scoping that delimit effects 
of broken markup. With proper semantics, it is easier to reason about 
markup and not rely on whimsical behavior of whatever tool we used 
yesterday or use today or might use tomorrow.


We are working towards these kind of scoping semantics and the first 
step on the way is to get a HTML5 treebuilder / parser in place.


Subbu.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-18 Thread Mr. Stradivarius
On Tue, Aug 18, 2015 at 11:48 PM, Derk-Jan Hartman 
d.j.hartman+wmf...@gmail.com wrote:

 If we want to do away with Tidy, we will have to make all editors perfect
 html authors


In my experience, mismatched tags are quite often used on purpose. For
example, Cyberpower678 has two unmatched div tags at the end of his
StandardLayout template
https://en.wikipedia.org/wiki/User:Cyberpower678/StandardLayout, used to
put a shaded border round the posts on his talk page
https://en.wikipedia.org/wiki/User_talk:Cyberpower678. There are no
corresponding closing div tags at the end of the talk page, as they would
be moved by the talk page archive bot, and Tidy takes care of the invalid
HTML anyway.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-18 Thread Bartosz Dziewoński

On Tue, 18 Aug 2015 05:15:05 +0200, MZMcBride z...@mzmcbride.com wrote:

The only cited example of real breakage so far has been mismatched  
divs.
How often are you or anyone else adding divs to pages? In my  
experience,

most users rely on MediaWiki templates for any kind of complex markup.

Echoing my initial reply in this thread, I still don't really understand
what behaviors from Tidy we want to keep. I've been following
https://phabricator.wikimedia.org/T89331 a bit and it also hasn't  
helped

answer this question.


Mismatched any tags. A an opening foo or closing /foo tag without a  
pair can wreak havoc on the entire page, including the interface.


I recall reports of unclosed small or b reducing the font size of or  
bolding the entire page. I can't find that one, but here's a small  
collection of bugs caused by Tidy unintentionally not running in various  
contexts: T27888 T29889 T40273 T44016 T60042 T60439.


You could easily engineer this to hide the tabs if you were malicious  
(making it impossible for casual users to edit the page, say, to fix the  
broken markup), and it might even be doable by accident.


We really do need this feature. Not anything else that Tidy does, most of  
its behavior is actually damaging, but we need to match the open and close  
tags to prevent the interface from getting jumbled.


--
Bartosz Dziewoński

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-17 Thread MZMcBride
Brian Wolff wrote:
I dont know about that. Viz editor is targeting ordinary tasks. Its the
complex things that mess stuff up.

In most contexts, solving the ordinary/common cases is a pretty big win.

Failing fast and loud is good in lots of contexts. I dont think wiki
editing is one of them.

The only cited example of real breakage so far has been mismatched divs.
How often are you or anyone else adding divs to pages? In my experience,
most users rely on MediaWiki templates for any kind of complex markup.

Echoing my initial reply in this thread, I still don't really understand
what behaviors from Tidy we want to keep. I've been following
https://phabricator.wikimedia.org/T89331 a bit and it also hasn't helped
answer this question.

Afaik, anchors are disallowed because spammers commonly insert them. Its
trivial to sanitize and allow them if we so desired.

Spammers can trivially insert anchors (links). Additional wrapper markup
isn't even needed; we automatically render hyperlinks if a string has a
prefix that looks like it might be a URL. In any case, this is the subject
of https://phabricator.wikimedia.org/T35886.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-17 Thread Tobias Oetterer
Hey

 The only cited example of real breakage so far has been mismatched divs.
 How often are you or anyone else adding divs to pages? In my experience, 
 most users rely on MediaWiki templates for any kind of complex markup.

I don't know how up to date this manual page is, but mediawiki.org explicitly 
states that templates copied from wikipedias may need to have tidy activated 
in order to work properly [1]. The old Template:Infobox (before Scribunto) 
sure did...


[1]: 
https://www.mediawiki.org/wiki/Manual:Using_content_from_Wikipedia#HTMLTidy

Regards,
Tobias Oetterer

--
If this email is rather brief, it is not meant to be impolite but to respect 
your time.
http://five.sentenc.es
No trees were killed to send this message, but a large number of electrons 
were terribly inconvenienced

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-17 Thread Subramanya Sastry

On 08/17/2015 10:15 PM, MZMcBride wrote:

Failing fast and loud is good in lots of contexts. I dont think wiki
editing is one of them.

The only cited example of real breakage so far has been mismatched divs.
How often are you or anyone else adding divs to pages? In my experience,
most users rely on MediaWiki templates for any kind of complex markup.

Echoing my initial reply in this thread, I still don't really understand
what behaviors from Tidy we want to keep. I've been following
https://phabricator.wikimedia.org/T89331 a bit and it also hasn't helped
answer this question.


Wikitext is string-based and generates a html string and in the general 
case, it need not be well-formed HTML. There is a lot of broken wikitext 
out there and if you remove Tidy and don't introduce a HTML5 parser 
based balancer, you are going to see a lot of breakage.


* Unclosed HTML tags (very common)
* Misnested tags
* Misnesting of tags (ex: links in links .. [http://foo.bar this is a 
[[foobar]] company])
* Fostered content in tables 
(tablethis-content-will-show-up-outside-the-tabletrtd/td/tr/table) 
... this has been one of the biggest source of complexity inside Parsoid 
... in combination with templates, this is nasty.
* Other ways in which HTML5 content model might be violated. (ex: 
small\n*a\n*b\n/small)
* Look at the parser tests file and see all the tests we've added with 
annotations that say php parser relies on tidy


[[ Tangent: We have a linting option in Parsoid that we can turn on in 
production that can dump information about all these broken forms of 
wikitext (we have this information because we have to break the wikitext 
in the same ways when we convert html to wikitext). We haven't turned it 
on in production yet because we haven't yet had the time to hook this 
into project wikicheck .. we had initial conversations, but we couldn't 
follow up on our end. ]]


Besides these, there is also other unrelated-to-html5-semantics behavior 
that wikis have come to rely on.
* Stripping of empty tags -- correct page rendering rely on the fact 
that Tidy strips empty elements from HTML. We had to explicitly add this 
behavior to Parsoid so pages render identically. We could rip this out 
as long as all those templates are fixed up. The infobox on itwiki:Luna 
relies on this, to give you a specific example.

* Some behaviors found in https://phabricator.wikimedia.org/T4542
* I am sure there are a bunch of other behaviors that I am missing / 
don't know about.


So, you cannot just rip out Tidy and not replace it with something in 
its place. Even replacing it with a HTML5 parser (as per the current 
plan) is not entirely straightforward simply because of all the other 
unrelated-to-html5-semantics behavior. Part of the task of replacing 
Tidy is to figure out all the ways those pages might break and the best 
way to handle that breakage.


Going forward, we are thinking about how to enforce stricter constraints 
on what templates (and extensions) can produce so impacts from broken 
wikitext is contained. That will give you some of what you are asking 
(fail fast, but in a different form). That requires a functioning 
html5 treebuilder / parser to be in place which is what this RFC is about.


Subbu.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-15 Thread Brian Wolff
On Saturday, August 15, 2015, MZMcBride z...@mzmcbride.com wrote:
 Robert Rohde wrote:
Some years back I was importing a large number of complex templates to a
wiki that didn't have tidy enabled.  The results were nothing short of
horrendous in a substantial number of cases.  Wiki authors will generally
stop worrying about their code as long as the results look right.  For
good or ill, tidy does a remarkable job of localizing unclosed tags, and
often that is enough to effectively fix the appearance of broken HTML
syntax so it doesn't spill over into other sections.  Without Tidy (or
its equivalent) there will be a lot of template garbage that needs to be
repaired.

 As we get saner input mechanisms (CodeEditor, VisualEditor, ScoreEditor,
 etc.), we'll likely see a reduction in direct HTML editing, which seems to
 be what most often results in introducing layout-disrupting invalid input.

I dont know about that. Viz editor is targeting ordinary tasks. Its the
complex things that mess stuff up.

The garbage in - garbage out approach might seem appealing in principle,
but any transition to such a condition is going to dredge up a lot of
malformed HTML code created by wiki editors that we've been hiding for
many years.  If one is going to replace Tidy with something substantially
different in execution, I would suggest that one needs a significant test
suite of complex pages in order to judge how bad the collateral damage is
likely to be, and ideally some set of tools to help editors fix it.

 I think dredging up bad input in order to fix it is appropriate. A
 transition period could include the ability to temporarily render a page
 without Tidy enabled to see what issues present themselves. As I said
 previously, browsers are fairly resilient to moderately bad input, but
 even the really bad code should probably be properly addressed via the
 wiki process instead of being glossed over with magical fixes and
 replacements in the form of Tidy.

 In addition to following the garbage principle, we would also be following
 the idea of failing fast and loudly, if the layout gets borked by a
missing
 tag, for example.

Failing fast and loud is good in lots of contexts. I dont think wiki
editing is one of them.

 (In continuing to think about this problem generally and how other
 sites/platforms have solved or mitigated it, it's amusing to me that we
 allow div, span, and inline styling and arbitrary attributes (both of
 which require separate sanitization), and yet we continue to disallow
 rendering of the anchor element.)


Afaik, anchors are disallowed because spammers commonly insert them. Its
trivial to sanitize and allow them if we so desired.

--
bawolff


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-15 Thread MZMcBride
Robert Rohde wrote:
Some years back I was importing a large number of complex templates to a
wiki that didn't have tidy enabled.  The results were nothing short of
horrendous in a substantial number of cases.  Wiki authors will generally
stop worrying about their code as long as the results look right.  For
good or ill, tidy does a remarkable job of localizing unclosed tags, and
often that is enough to effectively fix the appearance of broken HTML
syntax so it doesn't spill over into other sections.  Without Tidy (or
its equivalent) there will be a lot of template garbage that needs to be
repaired.

As we get saner input mechanisms (CodeEditor, VisualEditor, ScoreEditor,
etc.), we'll likely see a reduction in direct HTML editing, which seems to
be what most often results in introducing layout-disrupting invalid input.

The garbage in - garbage out approach might seem appealing in principle,
but any transition to such a condition is going to dredge up a lot of
malformed HTML code created by wiki editors that we've been hiding for
many years.  If one is going to replace Tidy with something substantially
different in execution, I would suggest that one needs a significant test
suite of complex pages in order to judge how bad the collateral damage is
likely to be, and ideally some set of tools to help editors fix it.

I think dredging up bad input in order to fix it is appropriate. A
transition period could include the ability to temporarily render a page
without Tidy enabled to see what issues present themselves. As I said
previously, browsers are fairly resilient to moderately bad input, but
even the really bad code should probably be properly addressed via the
wiki process instead of being glossed over with magical fixes and
replacements in the form of Tidy.

In addition to following the garbage principle, we would also be following
the idea of failing fast and loudly, if the layout gets borked by a missing
tag, for example.

(In continuing to think about this problem generally and how other
sites/platforms have solved or mitigated it, it's amusing to me that we
allow div, span, and inline styling and arbitrary attributes (both of
which require separate sanitization), and yet we continue to disallow
rendering of the anchor element.)

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-13 Thread Robert Rohde
Some years back I was importing a large number of complex templates to a
wiki that didn't have tidy enabled.  The results were nothing short of
horrendous in a substantial number of cases.  Wiki authors will generally
stop worrying about their code as long as the results look right.  For good
or ill, tidy does a remarkable job of localizing unclosed tags, and often
that is enough to effectively fix the appearance of broken HTML syntax so
it doesn't spill over into other sections.  Without Tidy (or its
equivalent) there will be a lot of template garbage that needs to be
repaired.

The garbage in - garbage out approach might seem appealing in principle,
but any transition to such a condition is going to dredge up a lot of
malformed HTML code created by wiki editors that we've been hiding for many
years.  If one is going to replace Tidy with something substantially
different in execution, I would suggest that one needs a significant test
suite of complex pages in order to judge how bad the collateral damage is
likely to be, and ideally some set of tools to help editors fix it.

-Robert Rohde

On Thu, Aug 13, 2015 at 7:51 AM, Brian Wolff bawo...@gmail.com wrote:

 On 8/12/15, MZMcBride z...@mzmcbride.com wrote:
  Tim Starling wrote:
 https://phabricator.wikimedia.org/T89331
 
 Running the output of the MediaWiki parser through HTML Tidy always
 seemed like a nasty hack. The effects on wikitext syntax are arbitrary
 and change from version to version. When we upgrade our Linux
 distribution, we sometimes see changes in the HTML generated by given
 wikitext, which is not ideal.
 
 [...]
 
 We can get nearly the same effect in MediaWiki by replacing the Tidy
 transformation stage with an HTML 5 parse followed by serialization of
 the DOM back to HTML. This would stabilize wikitext syntax and resolve
 several important syntax differences compared to Parsoid.
 
  Related tasks:
 
  * https://phabricator.wikimedia.org/T4542
  * https://phabricator.wikimedia.org/T56617
 
  It's not clear to me which behaviors from Tidy we want to keep. Looking
 at
  the various bugs that Tidy has caused, it's apparent that there a number
  of behaviors we want to disable/avoid.
 
  My understanding is that Tidy is not responsible for output sanitization
  and it's not responsible for preprocessing or parsing. MediaWiki handles
  all of that elsewhere. If Tidy is only needed for mismatched HTML
  elements, we could possibly catch and disallow or gracefully handle that
  specific use-case in MediaWiki. What other beneficial behavior of Tidy
  would we need to replicate?
 
  Or could we replace Tidy with nothing? Relying on the principle of
  garbage in, garbage out seems reasonable in some ways. And modern
  browsers are fairly adept at handling moderately bad HTML.
 
  MZMcBride
 
 

 The main thing tidy does (imo), is ensure that mismatched html fails
 are localized. When somebody makes a mistake, it can cause the entire
 skin to go whacko. We ideally want to have markup mistakes only affect
 the user generated content (and preferably, only around the area where
 the mistake is).

 --bawolff

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-12 Thread MZMcBride
Tim Starling wrote:
https://phabricator.wikimedia.org/T89331

Running the output of the MediaWiki parser through HTML Tidy always
seemed like a nasty hack. The effects on wikitext syntax are arbitrary
and change from version to version. When we upgrade our Linux
distribution, we sometimes see changes in the HTML generated by given
wikitext, which is not ideal.

[...]

We can get nearly the same effect in MediaWiki by replacing the Tidy
transformation stage with an HTML 5 parse followed by serialization of
the DOM back to HTML. This would stabilize wikitext syntax and resolve
several important syntax differences compared to Parsoid.

Related tasks:

* https://phabricator.wikimedia.org/T4542
* https://phabricator.wikimedia.org/T56617

It's not clear to me which behaviors from Tidy we want to keep. Looking at
the various bugs that Tidy has caused, it's apparent that there a number
of behaviors we want to disable/avoid.

My understanding is that Tidy is not responsible for output sanitization
and it's not responsible for preprocessing or parsing. MediaWiki handles
all of that elsewhere. If Tidy is only needed for mismatched HTML
elements, we could possibly catch and disallow or gracefully handle that
specific use-case in MediaWiki. What other beneficial behavior of Tidy
would we need to replicate?

Or could we replace Tidy with nothing? Relying on the principle of
garbage in, garbage out seems reasonable in some ways. And modern
browsers are fairly adept at handling moderately bad HTML.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-12 Thread Brian Wolff
On 8/12/15, MZMcBride z...@mzmcbride.com wrote:
 Tim Starling wrote:
https://phabricator.wikimedia.org/T89331

Running the output of the MediaWiki parser through HTML Tidy always
seemed like a nasty hack. The effects on wikitext syntax are arbitrary
and change from version to version. When we upgrade our Linux
distribution, we sometimes see changes in the HTML generated by given
wikitext, which is not ideal.

[...]

We can get nearly the same effect in MediaWiki by replacing the Tidy
transformation stage with an HTML 5 parse followed by serialization of
the DOM back to HTML. This would stabilize wikitext syntax and resolve
several important syntax differences compared to Parsoid.

 Related tasks:

 * https://phabricator.wikimedia.org/T4542
 * https://phabricator.wikimedia.org/T56617

 It's not clear to me which behaviors from Tidy we want to keep. Looking at
 the various bugs that Tidy has caused, it's apparent that there a number
 of behaviors we want to disable/avoid.

 My understanding is that Tidy is not responsible for output sanitization
 and it's not responsible for preprocessing or parsing. MediaWiki handles
 all of that elsewhere. If Tidy is only needed for mismatched HTML
 elements, we could possibly catch and disallow or gracefully handle that
 specific use-case in MediaWiki. What other beneficial behavior of Tidy
 would we need to replicate?

 Or could we replace Tidy with nothing? Relying on the principle of
 garbage in, garbage out seems reasonable in some ways. And modern
 browsers are fairly adept at handling moderately bad HTML.

 MZMcBride



The main thing tidy does (imo), is ensure that mismatched html fails
are localized. When somebody makes a mistake, it can cause the entire
skin to go whacko. We ideally want to have markup mistakes only affect
the user generated content (and preferably, only around the area where
the mistake is).

--bawolff

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-11 Thread Tim Starling
Language choice. Tidy is written in C. Note that I included shelling
out to Node.js as an option in my original post. It's not really part
of Parsoid, it's a JavaScript library that Parsoid uses. We would use
the same JavaScript library with a few lines of wrapper code.

-- Tim Starling

On 12/08/15 10:24, Trevor Parscal wrote:
 Interesting. What is the cause of the slower speed?
 
 - Trevor
 
 On Tuesday, August 11, 2015, Gabriel Wicke gwi...@wikimedia.org wrote:
 
 On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal tpars...@wikimedia.org
 javascript:;
 wrote:

 Is it possible use part of the Parsoid code to do this?


 It is possible to do this in Parsoid (or any node service) with this line:

  var sanerHTML = domino.createDocument(input).outerHTML;

 However, performance is about 2x worse than current tidy (116ms vs. 238ms
 for Obama), and about 4x slower than the fastest option in our tests. The
 task has a lot more benchmarks of various options.

 Gabriel






 - Trevor

 On Tuesday, August 11, 2015, Tim Starling tstarl...@wikimedia.org
 javascript:; wrote:

 I'm elevating this task of mine to RFC status:

 https://phabricator.wikimedia.org/T89331

 Running the output of the MediaWiki parser through HTML Tidy always
 seemed like a nasty hack. The effects on wikitext syntax are arbitrary
 and change from version to version. When we upgrade our Linux
 distribution, we sometimes see changes in the HTML generated by given
 wikitext, which is not ideal.

 Parsoid took a different approach. After token-level transformations,
 tokens are fed into the HTML 5 parse algorithm, a complex but
 well-specified algorithm which generates a DOM tree from quirky input
 text.

 http://www.w3.org/TR/html5/syntax.html

 We can get nearly the same effect in MediaWiki by replacing the Tidy
 transformation stage with an HTML 5 parse followed by serialization of
 the DOM back to HTML. This would stabilize wikitext syntax and resolve
 several important syntax differences compared to Parsoid.

 However:

 * I have not been able to find any PHP implementation of this
 algorithm. Masterminds and Ressio do not even attempt it. Electrolinux
 attempts it but does not implement the error recovery parts that are
 of interest to us.
 * Writing our own would be difficult.
 * Even if we did write it, it would probably be too slow.

 So the question is: what language should we use? Since this is the
 standard programmer troll question, please bring popcorn.

 The best implementation of this algorithm is in Java: the validator.nu
 parser is maintained by Mozilla, and has source translation to C++,
 which is used by Mozilla and could potentially be used for an HHVM
 extension.

 There is also a Rust port (also written by Mozilla), and notable
 implementations in JavaScript and Python.

 For WMF, a Java service would be quite easily done, and I have
 prototyped it already. An HHVM extension might also be possible. A
 non-service fallback for small installations might be Node.js or a
 compiled binary from Rust or C++.

 -- Tim Starling


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org javascript:; javascript:;
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org javascript:;
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




 --
 Gabriel Wicke
 Principal Engineer, Wikimedia Foundation
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org javascript:;
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-11 Thread Gabriel Wicke
On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal tpars...@wikimedia.org
wrote:

 Is it possible use part of the Parsoid code to do this?


It is possible to do this in Parsoid (or any node service) with this line:

 var sanerHTML = domino.createDocument(input).outerHTML;

However, performance is about 2x worse than current tidy (116ms vs. 238ms
for Obama), and about 4x slower than the fastest option in our tests. The
task has a lot more benchmarks of various options.

Gabriel






 - Trevor

 On Tuesday, August 11, 2015, Tim Starling tstarl...@wikimedia.org wrote:

  I'm elevating this task of mine to RFC status:
 
  https://phabricator.wikimedia.org/T89331
 
  Running the output of the MediaWiki parser through HTML Tidy always
  seemed like a nasty hack. The effects on wikitext syntax are arbitrary
  and change from version to version. When we upgrade our Linux
  distribution, we sometimes see changes in the HTML generated by given
  wikitext, which is not ideal.
 
  Parsoid took a different approach. After token-level transformations,
  tokens are fed into the HTML 5 parse algorithm, a complex but
  well-specified algorithm which generates a DOM tree from quirky input
  text.
 
  http://www.w3.org/TR/html5/syntax.html
 
  We can get nearly the same effect in MediaWiki by replacing the Tidy
  transformation stage with an HTML 5 parse followed by serialization of
  the DOM back to HTML. This would stabilize wikitext syntax and resolve
  several important syntax differences compared to Parsoid.
 
  However:
 
  * I have not been able to find any PHP implementation of this
  algorithm. Masterminds and Ressio do not even attempt it. Electrolinux
  attempts it but does not implement the error recovery parts that are
  of interest to us.
  * Writing our own would be difficult.
  * Even if we did write it, it would probably be too slow.
 
  So the question is: what language should we use? Since this is the
  standard programmer troll question, please bring popcorn.
 
  The best implementation of this algorithm is in Java: the validator.nu
  parser is maintained by Mozilla, and has source translation to C++,
  which is used by Mozilla and could potentially be used for an HHVM
  extension.
 
  There is also a Rust port (also written by Mozilla), and notable
  implementations in JavaScript and Python.
 
  For WMF, a Java service would be quite easily done, and I have
  prototyped it already. An HHVM extension might also be possible. A
  non-service fallback for small installations might be Node.js or a
  compiled binary from Rust or C++.
 
  -- Tim Starling
 
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org javascript:;
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-11 Thread Trevor Parscal
Interesting. What is the cause of the slower speed?

- Trevor

On Tuesday, August 11, 2015, Gabriel Wicke gwi...@wikimedia.org wrote:

 On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal tpars...@wikimedia.org
 javascript:;
 wrote:

  Is it possible use part of the Parsoid code to do this?
 

 It is possible to do this in Parsoid (or any node service) with this line:

  var sanerHTML = domino.createDocument(input).outerHTML;

 However, performance is about 2x worse than current tidy (116ms vs. 238ms
 for Obama), and about 4x slower than the fastest option in our tests. The
 task has a lot more benchmarks of various options.

 Gabriel





 
  - Trevor
 
  On Tuesday, August 11, 2015, Tim Starling tstarl...@wikimedia.org
 javascript:; wrote:
 
   I'm elevating this task of mine to RFC status:
  
   https://phabricator.wikimedia.org/T89331
  
   Running the output of the MediaWiki parser through HTML Tidy always
   seemed like a nasty hack. The effects on wikitext syntax are arbitrary
   and change from version to version. When we upgrade our Linux
   distribution, we sometimes see changes in the HTML generated by given
   wikitext, which is not ideal.
  
   Parsoid took a different approach. After token-level transformations,
   tokens are fed into the HTML 5 parse algorithm, a complex but
   well-specified algorithm which generates a DOM tree from quirky input
   text.
  
   http://www.w3.org/TR/html5/syntax.html
  
   We can get nearly the same effect in MediaWiki by replacing the Tidy
   transformation stage with an HTML 5 parse followed by serialization of
   the DOM back to HTML. This would stabilize wikitext syntax and resolve
   several important syntax differences compared to Parsoid.
  
   However:
  
   * I have not been able to find any PHP implementation of this
   algorithm. Masterminds and Ressio do not even attempt it. Electrolinux
   attempts it but does not implement the error recovery parts that are
   of interest to us.
   * Writing our own would be difficult.
   * Even if we did write it, it would probably be too slow.
  
   So the question is: what language should we use? Since this is the
   standard programmer troll question, please bring popcorn.
  
   The best implementation of this algorithm is in Java: the validator.nu
   parser is maintained by Mozilla, and has source translation to C++,
   which is used by Mozilla and could potentially be used for an HHVM
   extension.
  
   There is also a Rust port (also written by Mozilla), and notable
   implementations in JavaScript and Python.
  
   For WMF, a Java service would be quite easily done, and I have
   prototyped it already. An HHVM extension might also be possible. A
   non-service fallback for small installations might be Node.js or a
   compiled binary from Rust or C++.
  
   -- Tim Starling
  
  
   ___
   Wikitech-l mailing list
   Wikitech-l@lists.wikimedia.org javascript:; javascript:;
   https://lists.wikimedia.org/mailman/listinfo/wikitech-l
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org javascript:;
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 



 --
 Gabriel Wicke
 Principal Engineer, Wikimedia Foundation
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org javascript:;
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-11 Thread Trevor Parscal
Is it possible use part of the Parsoid code to do this?

- Trevor

On Tuesday, August 11, 2015, Tim Starling tstarl...@wikimedia.org wrote:

 I'm elevating this task of mine to RFC status:

 https://phabricator.wikimedia.org/T89331

 Running the output of the MediaWiki parser through HTML Tidy always
 seemed like a nasty hack. The effects on wikitext syntax are arbitrary
 and change from version to version. When we upgrade our Linux
 distribution, we sometimes see changes in the HTML generated by given
 wikitext, which is not ideal.

 Parsoid took a different approach. After token-level transformations,
 tokens are fed into the HTML 5 parse algorithm, a complex but
 well-specified algorithm which generates a DOM tree from quirky input
 text.

 http://www.w3.org/TR/html5/syntax.html

 We can get nearly the same effect in MediaWiki by replacing the Tidy
 transformation stage with an HTML 5 parse followed by serialization of
 the DOM back to HTML. This would stabilize wikitext syntax and resolve
 several important syntax differences compared to Parsoid.

 However:

 * I have not been able to find any PHP implementation of this
 algorithm. Masterminds and Ressio do not even attempt it. Electrolinux
 attempts it but does not implement the error recovery parts that are
 of interest to us.
 * Writing our own would be difficult.
 * Even if we did write it, it would probably be too slow.

 So the question is: what language should we use? Since this is the
 standard programmer troll question, please bring popcorn.

 The best implementation of this algorithm is in Java: the validator.nu
 parser is maintained by Mozilla, and has source translation to C++,
 which is used by Mozilla and could potentially be used for an HHVM
 extension.

 There is also a Rust port (also written by Mozilla), and notable
 implementations in JavaScript and Python.

 For WMF, a Java service would be quite easily done, and I have
 prototyped it already. An HHVM extension might also be possible. A
 non-service fallback for small installations might be Node.js or a
 compiled binary from Rust or C++.

 -- Tim Starling


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org javascript:;
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

2015-08-11 Thread Gabriel Wicke
On Tue, Aug 11, 2015 at 5:24 PM, Trevor Parscal tpars...@wikimedia.org
wrote:

 Interesting. What is the cause of the slower speed?


Mainly a pure-JS DOM implementation (domino) not being quite the same speed
as C or Rust with all optimizations turned on. The deltas are roughly in
line with language benchmarks like http://benchmarksgame.alioth.debian.org/.

Gabriel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l