|
Hello guys,
So happy to find a list interested in the future of
Web (HTML/CSS/W3 Standards).
Until i'll get a feeling of what's happening here i
will try only to read and learn from your messages. But, i have one problem,
that i am sure you might know how to handle it (i hope this is not offtopic in
here)
I have a web crawler, that i am using for personal
research. It crawls the entire site, finding all the links and creating a
sitemap, and grabs some statistics. After a while i felt that i can do more then
that, so i have decided to make it parse html code and extract some statistics
about tags. For the moment i have created an array with all HTML tags
(deprecated ones to), grouped by their structure type (block, inline, single -
thats how i call them). I am parsing the HTML code using regular expressions,
but as i've searched the net, i saw lots of people saying: dont parse html using
regex.
I studied a bit more, then i've found the relation
between the HTML document and the DTD (Document Type Definition)
declaration. I've noticed that browsers rely on it (the ones that are
public are cached, and the custom ones are grabbed before the HTML document is
parsed).
Can you point me out to some documentation that
explains the way a browser parses HTML documents, or the way it uses the
DTD document for interpreting the tags and their attributes.
Another thing that is that everyone recomended to
use an already build library, but i want to slowly learn the whole parsing
process by myself, so i can understand all the priciples.
Thanks a lot!
Best wishes,
--------------------------------------
Serban Gh. Ghita Project Manager VERASYS Intl. Web Dept. Bucuresti, ROMANIA Tel: +40-21-201.67.62 Fax: +40-251-306.017 GSM: +40-788-28.29.10 email: [EMAIL PROTECTED] email: [EMAIL PROTECTED] www.verasys.com / www.itpromo.ro |
- [whatwg] hello list Serban Ghita
- Re: [whatwg] hello list Simon Pieters
- Re: [whatwg] hello list Serban Ghita
