On 12.07.2010 16:43, Mike Wilcox wrote:
On Jul 12, 2010, at 8:39 AM, Nils Dagsson Moskopp wrote:

That's a little different. Google purposely uses unstandardized,
incorrect HTML in ways that still render in a browser in order to
make it more difficult for screen scrapers. They also "break it" in a
different way every week.

Assuming this is true (which I find difficult to believe), wouldn't a
screen scraper based on the HTML5 parsing algorithm defeat this
purpose ?

Honestly, I don't know. But W3 defaulted to an HTML5 validator:
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fsearch%3Fsource%3Dig%26hl%3Den%26rlz%3D%26%3D%26q%3Dhtml5%26aq%3Df%26aqi%3D%26aql%3D%26oq%3D%26gs_rfai%3D&charset=%28detect+automatically%29&doctype=Inline&group=0
<http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fsearch%3Fsource%3Dig%26hl%3Den%26rlz%3D%26%3D%26q%3Dhtml5%26aq%3Df%26aqi%3D%26aql%3D%26oq%3D%26gs_rfai%3D&charset=%28detect+automatically%29&doctype=Inline&group=0>

True, but a parser conforming to the spec (*) would handle those errors, so in this case obfuscation wouldn't work. Essentially, any code using that parser would see the same information as an off-the-shelf web browser.

...
Besides the protecting of their API, Google also will scratch and claw
to save every byte. They are the gold standard of a high performance

Understood. There's an ongoing controversy whether it makes sense to make things like these invalid (just stating, not offering an opinion).

website. While this may or may not explain the things that don't
validate, what it does say is that nothing coming from google.com
<http://google.com> is accidental.
...

I believe some time ago a certain Google employee actually *did* state that some of the conformance problems were unintentional. (yes, I did spend a few minutes finding that statement but wasn't successful).

Best regards, Julian

(*) Implementing error recovery, which IMHO isn't required.

Reply via email to