Re: [xml] Approach for parsing HTML file or URL

Brian Kim Tue, 04 Aug 2009 08:42:01 -0700

Hi. Thanks.

For example, <a href="aaa0", alt="aaa1"><em>test1</em> <em>test2</em>
I am a boy</a>


Here we have three nodes,
1. <a href="aaa0", alt="aaa1">I am a boy</a>
2. <em>test1</em>
3. <em>test2</em>

Then, I want to analyze those nodes as follows.
The tag of node 1 is "a". Its attributes are href and alt, which have
"aaa0" and "aaa1" respectively
Also, it has an anchor text, "I am a boy"
The other two tags are "em", which has "test1" and "test2" as an anchor text.

This kind of level is enough for me.
Does anybody help me?

In fact, I have created a sample code with a xpath example. For the
simple html input,
my code got the almost correct parsing result, but when I tried to
parse a html from URL, which is, of course,
more complex than a simple html, I got a weird data.
In the above example, "I am a boy" is obviously an anchor text of the
tag, "a". With this simple html,
I get it that way. However, it have been interpreted that "I am a boy"
is an anchor text of "em", if it is a part of a complex html.
Can I say if a html is not well-formed, then the association between
tag and anchor text is not sometimes handled properly?
In other words, is there a possibility that a parsing tree is not
perfectly correct if the html is not well-formed?

In fact, I want to double-check if my way is right or not, seeing some
general way of looking at html-parsed tree nodes that somebody may
suggest.

Thanks

Date: Tue, 04 Aug 2009 08:51:42 +0200
From: Michael Ludwig <[email protected]>
Subject: Re: [xml] Approach for parsing HTML file or URL
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Brian Kim schrieb:

> I would like to parse html and see the content of html attributes in
> each tag.

> Using htmlreadfile function is quite obvious, but I guess there is
> another way to see each node of parsed tree instead of using Xpath.

Could you define what you mean by "seeing each node of the parsed tree"?

Michael Ludwig
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Approach for parsing HTML file or URL

Reply via email to