Parsing HTML entities

Tobia Conforto Fri, 31 Aug 2007 06:25:02 -0700

Hello

I have a data source from which I get SAX text nodes into my pipeline
that contain escaped HTML entities and <br> tags.  In Java syntax:


"Lorem ipsum &mdash; dolor sit amet. <br> Consectetuer"

or, in XML syntax:

Lorem ipsum &amp;mdash; dolor sit amet. &lt;br&gt; Consectetuer

As you can see, the entities and <br> tags are escaped and part of the
text node.

I cannot change this data source component, therefore I need a
transformer to examine every text node in the stream, split it at the
fake "<br>" tags, substitute them with <xhtml:br/> elements, and
replace every escaped entity with the relevant Unicode character.

I tried doing it with the Parser transformer, but it's too slow.

I tried using the HTML transformer, but I couldn't get it to work.


My question is: what do you suggest I use on the Java side?

Is there anything like PHP's html_entity_decode() available somewhere
in a library that Cocoon is already using, that can parse and convert
HTML 4.0 entities with a single pass on the string?


Tobia

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Parsing HTML entities

Reply via email to