Re: [whatwg] hello list

Simon Pieters Sun, 16 Apr 2006 13:58:32 -0700

Hi,

From: "Serban Ghita" <[EMAIL PROTECTED]>

I have a web crawler, that i am using for personal research. It crawls theentire site, finding all the links and creating a sitemap, and grabs somestatistics. After a while i felt that i can do more then that, so i havedecided to make it parse html code and extract some statistics about tags.


You may be interested in Google's Web Authoring Statistics[1].

For the moment i have created an array with all HTML tags (deprecated onesto), grouped by their structure type (block, inline, single - thats how icall them). I am parsing the HTML code using regular expressions, but asi've searched the net, i saw lots of people saying: dont parse html usingregex.

You can't reliably parse HTML with regexp because HTML has more complicatedparsing rules.

I studied a bit more, then i've found the relation between the HTMLdocument and the DTD (Document Type Definition) declaration. I've noticedthat browsers rely on it (the ones that are public are cached, and thecustom ones are grabbed before the HTML document is parsed).


Actually, browsers don't parse DTDs at all for HTML.

Can you point me out to some documentation that explains the way a browserparses HTML documents, or the way it uses the DTD document for interpretingthe tags and their attributes.


It is specified in the Parsing section[2] of Web Applications 1.0.

[1] http://code.google.com/webstats/index.html
[2] http://whatwg.org/specs/web-apps/current-work/#parsing

Regards,
Simon Pieters

Re: [whatwg] hello list

Reply via email to