Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;
On Saturday, May 14, 2016, Strainuwrote: > 2016-05-14 4:07 GMT+03:00 Legoktm : >> Hi, >> >> On 05/02/2016 11:42 AM, Brian Wolff wrote: >>> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would >>> appreciate everyone's feedback. >> >> Given the lack of objections here and on Gerrit, I went ahead and merged >> it today. > > Can you please clarify if this change will have any effect on > non-valid HTML in the Wikitext? I suppose no change will occur, since > this was the default anyway, but I'd like a confirmation. > > Strainu > That is correct. Nothing will change about invalid html - if you have tidy enabled the invalid html gets fixed, if you dont it does not. -- bawolff ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;
Le 14/05/2016 à 03:07, Legoktm a écrit : > Hi, > > On 05/02/2016 11:42 AM, Brian Wolff wrote: >> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would >> appreciate everyone's feedback. > > Given the lack of objections here and on Gerrit, I went ahead and merged > it today. Hello, That sounds good. I would suggest to apply to REL1_27 as well. -- Antoine "hashar" Musso ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;
2016-05-14 4:07 GMT+03:00 Legoktm: > Hi, > > On 05/02/2016 11:42 AM, Brian Wolff wrote: >> See gerrit patch https://gerrit.wikimedia.org/r/286495 I would >> appreciate everyone's feedback. > > Given the lack of objections here and on Gerrit, I went ahead and merged > it today. Can you please clarify if this change will have any effect on non-valid HTML in the Wikitext? I suppose no change will occur, since this was the default anyway, but I'd like a confirmation. Strainu ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;
Hi, On 05/02/2016 11:42 AM, Brian Wolff wrote: > See gerrit patch https://gerrit.wikimedia.org/r/286495 I would > appreciate everyone's feedback. Given the lack of objections here and on Gerrit, I went ahead and merged it today. -- Legoktm ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;
On Monday, May 2, 2016, Max Semenikwrote: > On Mon, May 2, 2016 at 3:04 PM, Brian Wolff wrote: > >> > At this point, I would say that everybody who screen-scrapes saw it coming > and breaking them is a good thing as sometimes, lessons just have to be > learned. > Personally, I dont think we should shy away from breaking screen scrapers if we get something out of it, but in this case I dont see the benefit. Breaking things because we can without getting any benefit (or only trivial benefits) seems rather pointless and kind of mean to those who do scrape. -- bawolff ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;
On Tue, May 3, 2016 at 2:43 AM, Max Semenikwrote: > At this point, I would say that everybody who screen-scrapes saw it coming > and breaking them is a good thing as sometimes, lessons just have to be > learned. > There aren't many options other than content-scraping if you want to transform Wikipedia articles into some semblance of structured data. We even do it ourselves, for media metadata (and use an XML parser for it, as PHP doesn't offer much in the way of parsing HTML5, so outputting HTML5-style empty tags might break it - although IIRC there is a hack to work around that as file pages can contain ill-formed HTML anyway). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;
On Tue, May 3, 2016 at 4:34 PM, Gergo Tiszawrote: > > There aren't many options other than content-scraping if you want to > transform Wikipedia articles into some semblance of structured data. We > even do it ourselves, for media metadata (and use an XML parser for it > Actually the XML parser has been replaced with DOMDocument a while ago, which can handle HTML5 fine. But the point stands: HTML scraping is hardly an unusual requirement for reusers of our content. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;
On Mon, May 2, 2016 at 3:04 PM, Brian Wolffwrote: > > There are references to it breaking people's screen scraping bots last time > it was turned on. That was like 5 years ago though. > At this point, I would say that everybody who screen-scrapes saw it coming and breaking them is a good thing as sometimes, lessons just have to be learned. Best regards, Max Semenik ([[User:MaxSem]]) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;
> > The only benefit of $wgWellFormedXml was that you could toss your > "well-formed" tag soup into an XML parser that didn't grok HTML. I have no > idea if that worked reliably or was actually useful to anyone, but it's > probably worth confirming that before actually removing the funky > self-closing tags. > There are references to it breaking people's screen scraping bots last time it was turned on. That was like 5 years ago though. --bawolff ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Getting rid of $wgWellFormedXml = false;
I'd say an HTML5 output mode *ought* to work like this: *Don't try to be clever.* * Consistency and predictability are key to both security review and data consumability. *Quote attributes consistently and predictably.* * Always use double-quotes on attributes in output. *Output specced empty tags in HTML style.* * , , are fine and not ambiguous at all to an HTML parser. There's no need to go adding a "/" in at the end! * These are already whitelisted in the Html class so it's easy to not mess this up. *Don't do other silly things for old-school XHTML 1.* * CDATA wrapping of
[Wikitech-l] Getting rid of $wgWellFormedXml = false;
So currently, we have two ways of outputting html - $wgWellFormedXml = true (The default), outputs html that happens to conform with the rules of XML. $wgWellFormedXml = false on the other hand, uses more lax html5 rules to save a few bytes. Having two modes of output, feels rather silly to me. Originally I think this was meant as a feature flag well $wgWellFormedXml=false stabilized, but it never got turned on, and here we are 7 years later. Having $wgWellFormedXml=false increases the complexity of the code, and not all that many people use it (Notable exception is translatewiki). I think its important that security critical code be as simple as possible. Furthermore, there seems to be very little benefit to having the second mode (After you account for gzip, saving a few bytes from writing instead of really doesn't matter, imo) With that in mind, I would like to propose killing $wgWellFormedXml = false; I'm not so much attached to the true mode (Although I do feel the true mode is significantly more sane), as I just simply want there to be a single mode. Putting the default to false was vetoed in T52040, so I think that true would be the best choice to go with going forward if we are getting rid of one of the modes. If there are aspects of the other mode that people really want, then I think we should simply merge that in to the default behavior instead of having two separate modes. See gerrit patch https://gerrit.wikimedia.org/r/286495 I would appreciate everyone's feedback. Thanks, Brian ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l