Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-14 Thread Brian Wolff
On Saturday, May 14, 2016, Strainu  wrote:
> 2016-05-14 4:07 GMT+03:00 Legoktm :
>> Hi,
>>
>> On 05/02/2016 11:42 AM, Brian Wolff wrote:
>>> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
>>> appreciate everyone's feedback.
>>
>> Given the lack of objections here and on Gerrit, I went ahead and merged
>> it today.
>
> Can you please clarify if this change will have any effect on
> non-valid HTML in the Wikitext? I suppose no change will occur, since
> this was the default anyway, but I'd like a confirmation.
>
> Strainu
>

That is correct. Nothing will change about invalid html - if you have tidy
enabled the invalid html gets fixed, if you dont it does not.

--
bawolff
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-14 Thread Antoine Musso
Le 14/05/2016 à 03:07, Legoktm a écrit :
> Hi,
> 
> On 05/02/2016 11:42 AM, Brian Wolff wrote:
>> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
>> appreciate everyone's feedback.
> 
> Given the lack of objections here and on Gerrit, I went ahead and merged
> it today.

Hello,

That sounds good. I would suggest to apply to REL1_27 as well.

-- 
Antoine "hashar" Musso


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-14 Thread Strainu
2016-05-14 4:07 GMT+03:00 Legoktm :
> Hi,
>
> On 05/02/2016 11:42 AM, Brian Wolff wrote:
>> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
>> appreciate everyone's feedback.
>
> Given the lack of objections here and on Gerrit, I went ahead and merged
> it today.

Can you please clarify if this change will have any effect on
non-valid HTML in the Wikitext? I suppose no change will occur, since
this was the default anyway, but I'd like a confirmation.

Strainu

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-13 Thread Legoktm
Hi,

On 05/02/2016 11:42 AM, Brian Wolff wrote:
> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
> appreciate everyone's feedback.

Given the lack of objections here and on Gerrit, I went ahead and merged
it today.

-- Legoktm

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-03 Thread Brian Wolff
On Monday, May 2, 2016, Max Semenik  wrote:
> On Mon, May 2, 2016 at 3:04 PM, Brian Wolff  wrote:
>
>>

> At this point, I would say that everybody who screen-scrapes saw it coming
> and breaking them is a good thing as sometimes, lessons just have to be
> learned.
>

Personally, I dont think we should shy away from breaking screen scrapers
if we get something out of it, but in this case I dont see the benefit.
Breaking things because we can without getting any benefit (or only trivial
benefits) seems rather pointless and kind of mean to those who do scrape.

--
bawolff
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-03 Thread Gergo Tisza
On Tue, May 3, 2016 at 2:43 AM, Max Semenik  wrote:

> At this point, I would say that everybody who screen-scrapes saw it coming
> and breaking them is a good thing as sometimes, lessons just have to be
> learned.
>

There aren't many options other than content-scraping if you want to
transform Wikipedia articles into some semblance of structured data. We
even do it ourselves, for media metadata (and use an XML parser for it, as
PHP doesn't offer much in the way of parsing HTML5, so outputting
HTML5-style empty tags might break it - although IIRC there is a hack to
work around that as file pages can contain ill-formed HTML anyway).
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-03 Thread Gergo Tisza
On Tue, May 3, 2016 at 4:34 PM, Gergo Tisza  wrote:
>
> There aren't many options other than content-scraping if you want to
> transform Wikipedia articles into some semblance of structured data. We
> even do it ourselves, for media metadata (and use an XML parser for it
>

Actually the XML parser has been replaced with DOMDocument a while ago,
which can handle HTML5 fine. But the point stands: HTML scraping is hardly
an unusual requirement for reusers of our content.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-02 Thread Max Semenik
On Mon, May 2, 2016 at 3:04 PM, Brian Wolff  wrote:

>
> There are references to it breaking people's screen scraping bots last time
> it was turned on. That was like 5 years ago though.
>

At this point, I would say that everybody who screen-scrapes saw it coming
and breaking them is a good thing as sometimes, lessons just have to be
learned.


Best regards,
Max Semenik ([[User:MaxSem]])
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-02 Thread Brian Wolff
>
> The only benefit of $wgWellFormedXml was that you could toss your
> "well-formed" tag soup into an XML parser that didn't grok HTML. I have no
> idea if that worked reliably or was actually useful to anyone, but it's
> probably worth confirming that before actually removing the funky
> self-closing tags.
>

There are references to it breaking people's screen scraping bots last time
it was turned on. That was like 5 years ago though.

--bawolff
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-02 Thread Brion Vibber
I'd say an HTML5 output mode *ought* to work like this:

*Don't try to be clever.*
* Consistency and predictability are key to both security review and data
consumability.

*Quote attributes consistently and predictably.*
* Always use double-quotes on attributes in output.

*Output specced empty tags in HTML style.*
* , ,  are fine and not ambiguous at all to an HTML parser.
There's no need to go adding a "/" in at the end!
* These are already whitelisted in the Html class so it's easy to not mess
this up.

*Don't do other silly things for old-school XHTML 1.*
* CDATA wrapping of 

[Wikitech-l] Getting rid of $wgWellFormedXml = false;

2016-05-02 Thread Brian Wolff
So currently, we have two ways of outputting html - $wgWellFormedXml =
true (The default), outputs html that happens to conform with the
rules of XML. $wgWellFormedXml = false on the other hand, uses more
lax html5 rules to save a few bytes.

Having two modes of output, feels rather silly to me. Originally I
think this was meant as a feature flag well $wgWellFormedXml=false
stabilized, but it never got turned on, and here we are 7 years later.

Having $wgWellFormedXml=false increases the complexity of the code,
and not all that many people use it (Notable exception is
translatewiki). I think its important that security critical code be
as simple as possible. Furthermore, there seems to be very little
benefit to having the second mode (After you account for gzip, saving
a few bytes from writing  instead of  really doesn't
matter, imo)

With that in mind, I would like to propose killing $wgWellFormedXml =
false; I'm not so much attached to the true mode (Although I do feel
the true mode is significantly more sane), as I just simply want there
to be a single mode. Putting the default to false was vetoed in
T52040, so I think that true would be the best choice to go with going
forward if we are getting rid of one of the modes.

If there are aspects of the other mode that people really want, then I
think we should simply merge that in to the default behavior instead
of having two separate modes.

See gerrit patch https://gerrit.wikimedia.org/r/286495 I would
appreciate everyone's feedback.

Thanks,
Brian

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l