Re: [whatwg] Article: Growing pains afflict HTML5 standardization

Julian Reschke Mon, 12 Jul 2010 08:41:52 -0700

On 12.07.2010 16:43, Mike Wilcox wrote:

On Jul 12, 2010, at 8:39 AM, Nils Dagsson Moskopp wrote:

That's a little different. Google purposely uses unstandardized,
incorrect HTML in ways that still render in a browser in order to
make it more difficult for screen scrapers. They also "break it" in a
different way every week.


Assuming this is true (which I find difficult to believe), wouldn't a
screen scraper based on the HTML5 parsing algorithm defeat this
purpose ?


Honestly, I don't know. But W3 defaulted to an HTML5 validator:
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fsearch%3Fsource%3Dig%26hl%3Den%26rlz%3D%26%3D%26q%3Dhtml5%26aq%3Df%26aqi%3D%26aql%3D%26oq%3D%26gs_rfai%3D&charset=%28detect+automatically%29&doctype=Inline&group=0
<http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fsearch%3Fsource%3Dig%26hl%3Den%26rlz%3D%26%3D%26q%3Dhtml5%26aq%3Df%26aqi%3D%26aql%3D%26oq%3D%26gs_rfai%3D&charset=%28detect+automatically%29&doctype=Inline&group=0>

True, but a parser conforming to the spec (*) would handle those errors,so in this case obfuscation wouldn't work. Essentially, any code usingthat parser would see the same information as an off-the-shelf web browser.

...
Besides the protecting of their API, Google also will scratch and claw
to save every byte. They are the gold standard of a high performance

Understood. There's an ongoing controversy whether it makes sense tomake things like these invalid (just stating, not offering an opinion).

website. While this may or may not explain the things that don't
validate, what it does say is that nothing coming from google.com
<http://google.com> is accidental.
...

I believe some time ago a certain Google employee actually *did* statethat some of the conformance problems were unintentional. (yes, I didspend a few minutes finding that statement but wasn't successful).


Best regards, Julian

(*) Implementing error recovery, which IMHO isn't required.

Re: [whatwg] Article: Growing pains afflict HTML5 standardization

Reply via email to