Hi,

From: "Serban Ghita" <[EMAIL PROTECTED]>
I have a web crawler, that i am using for personal research. It crawls the entire site, finding all the links and creating a sitemap, and grabs some statistics. After a while i felt that i can do more then that, so i have decided to make it parse html code and extract some statistics about tags.

You may be interested in Google's Web Authoring Statistics[1].

For the moment i have created an array with all HTML tags (deprecated ones to), grouped by their structure type (block, inline, single - thats how i call them). I am parsing the HTML code using regular expressions, but as i've searched the net, i saw lots of people saying: dont parse html using regex.

You can't reliably parse HTML with regexp because HTML has more complicated parsing rules.

I studied a bit more, then i've found the relation between the HTML document and the DTD (Document Type Definition) declaration. I've noticed that browsers rely on it (the ones that are public are cached, and the custom ones are grabbed before the HTML document is parsed).

Actually, browsers don't parse DTDs at all for HTML.

Can you point me out to some documentation that explains the way a browser parses HTML documents, or the way it uses the DTD document for interpreting the tags and their attributes.

It is specified in the Parsing section[2] of Web Applications 1.0.

[1] http://code.google.com/webstats/index.html
[2] http://whatwg.org/specs/web-apps/current-work/#parsing

Regards,
Simon Pieters


Reply via email to