Brian Kim schrieb:
For example, <a href="aaa0", alt="aaa1"><em>test1</em> <em>test2</em> I am a boy</a>
A comma in the attribute list is a syntax error.
Then, I want to analyze those nodes as follows. The tag of node 1 is "a". Its attributes are href and alt, which have "aaa0" and "aaa1" respectively Also, it has an anchor text, "I am a boy" The other two tags are "em", which has "test1" and "test2" as an anchor text. This kind of level is enough for me. Does anybody help me? In fact, I have created a sample code with a xpath example.
XPath and XSLT are very good high-level tools to achieve the analysis you want. You could also do this using DOM, but this would be more cumbersome.
For the simple html input, my code got the almost correct parsing result, but when I tried to parse a html from URL, which is, of course, more complex than a simple html, I got a weird data.
As pointed out, your simple sample input has a syntax error. Random HTML from the web may well have syntax errors, too.
Can I say if a html is not well-formed, then the association between tag and anchor text is not sometimes handled properly?
Wellformedness applies to XML, not to HTML. Note that from this vantage point, XHTML is XML, not HTML. HTML may be malformed, too, as in your simple sample above.
In other words, is there a possibility that a parsing tree is not perfectly correct if the html is not well-formed?
Definitely yes.
In fact, I want to double-check if my way is right or not, seeing some general way of looking at html-parsed tree nodes that somebody may suggest.
The HTML parser provided by LibXML2 is good. Other useful tools include TagSoup [1] and Tidy [2]. Michael Ludwig [1] http://home.ccil.org/~cowan/XML/tagsoup/ [2] http://tidy.sourceforge.net/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
