Hi all,

I'm looking at some pretty complex regexp's at the moment for parsing HTML,
stripping out some attributes, getting the values of others, etc etc.

The simple fact is that all these...

<A HREF=foo.php>
<A HREF='foo.php'>
<A HREF="foo.php">
<A HREF='foo.php' TARGET="something">
<A HREF=foo.php TARGET=something>
<A HREF = "foo.php">
<A HREF = 'foo.php'>

... and many more are *valid* HTML mark-up.  It starts to make the task look
rather daunting -- especially if I want to get it right, rather than taking
shortcuts.


I'm pretty sure I need to start looking at a state engine or HTML parser
which can identify a tag, and seek out all the attributes of it, according
to strictly valid HTML (preferably 4.01 strict), rather than working with
regexp's.

As it turns out, I'm only looking to work with <A> at this stage, so I don't
think it'll be a massive state engine, but I'd like to explore any other
options or existing libraries of code before considering such a beast.


Any links / articles / code / whatever warmly welcome :)


Justin French


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to