Aurelien Pernoud wrote:
Everything works fine, except that the entities found are always translated
by the parser to their equivalent in the characters() method :
& becomes &
becomes space
é becomes �
This is fine, but how do I get the ref back ? I must in my case keep the
existant otherwise I get errors in the XHTML generated.
As Joe mentioned, it's probably better to allow the
parser to do its job and pass the text of the entity
to the application. If you're dealing with XHTML,
then it should be the serializer's job to turn those
characters back into their entity references.
However...
If you want to know exactly what entity references
appear in the document (including character entity
refs like  ) then you can turn on a feature in
Xerces to notify the application of all entity refs.
See the following page for information on the
feature:
http://xml.apache.org/xerces2-j/features.html
But this would still pass on the characters between
the start/end entity ref calls. If you don't want
this, then you should extend the DOMParser or SAX-
Parser class to filter out this unwanted content.
However, realize that this would be a non-standard
way of dealing with these references.
Moreover, depending of encoding issue, some entities such as ’ are
translated to "?". I've set the encoding to ISO-8859-1, and didn't find
which one to use to get back the ’ ...
The appearance of a '?' is either a display issue
(i.e. the font doesn't have a glyph for that char)
or a serialization issue (i.e. that character can
not be represented in the specified encoding). I'm
guessing your problem is the latter -- please use
an encoding that can represent all the Unicode
characters, like UTF-8.
--
Andy Clark * [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]