Hey Erskine :)
I don't use JDOM to build DOM trees, but the following code works for me to
get the right thing from an HTML file I write out:
import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.apache.xml.serialize.*;
public class Crap
{
public static void main(String[] args)
{
try
{
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.newDocument();
Element root = doc.createElementNS("", "root");
doc.appendChild(root);
Element test = doc.createElementNS("", "test");
root.appendChild(test);
test.appendChild(doc.createTextNode("" + ((char)0x20AC) +
"," + ((char)0xAE)));
OutputFormat of = new OutputFormat("xml", "utf-8", true);
XMLSerializer ser = new XMLSerializer(new FileWriter(
"c:/test.html"), of);
ser.serialize(doc);
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
The point being: send the characters through as-is. Don't worry about
escaping them yourself, as that Xerces' job. Note that your IDE might
print gibberish if it doesn't have unicode-capable fonts and you redirect
to System.out rather than the file!
Hope this helps,
C
To: "'[EMAIL
PROTECTED]'" <[EMAIL PROTECTED]>
"Williams, Erskine BGI SF" cc: (bcc:
Constantine Georges/Towers Perrin)
<[EMAIL PROTECTED] Subject: xerces always escapes
ampersands
global.com>
08/04/2003 02:33 PM
Please respond to
xerces-j-user
I'm finding that xerces is always escaping ampersands, even when they are a
part of a character reference. For example, if I want to define a text
element like so: <someText>€</someText>, (where "€" is the
hexadecimal entity reference for the euro "EUR" sign) when xerces writes
this out to a file, I invariably get: "<someText>&#x20AC;</someText>"
Xerces is always escaping ampersands into the entity ref "&"
Perhaps my confusion arises out of poor understanding of xml, but I should
think that xerces would only escape ampersands that aren't a part of a
valid
entity reference, i.e., if an ampersand is immediately followed by a pound
(#) sign, it should leave it alone. Is there a more reliable way to
reference extended ascii characters in xml, so that they will pass through
xerces unmolested?
I use castor and dom4j to manipulate my xml in my application, but these
both use Xerces under the covers if I am not mistaken. Some simple test
cases are below. Any guidance is very much appreciated.
Cheers,
Erskine
/***********************
* Castor example
*
************************/
import java.io.FileWriter;
import java.io.File;
import org.exolab.castor.xml.Marshaller;
public class CastorTest {
public static void main(String [] args) {
//populate an arbitrary data object with special characters
Factsheet fs = new Factsheet();
ContentSections cs = new ContentSections();
Content c = new Content();
c.addPara("£ © ®");
cs.addContent(c);
fs.setContentSections(cs);
//now use the castor marshalling framework to write the data object out
to xml
try {
FileWriter fw = new FileWriter(new File("tmp.xml"));
Marshaller m = new Marshaller(fw);
m.setEncoding("iso-8859-1");
m.marshal(fs);
} catch (Exception e) {
e.printStackTrace();
}
}
}
The resulting xml file looks like:
<?xml version="1.0" encoding="iso-8859-1"?>
<factsheet>
<content>
<para>&#xA3; &#xA9; &#xAE;</para>
</content>
</factsheet>
/********************************
*
* Dom4J example
*
********************************/
import org.dom4j.Document;
import org.dom4j.DocumentHelper;
import org.dom4j.Element;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
public class JDomTest {
public static void main(String [] args) {
Document document = DocumentHelper.createDocument();
Element root = document.addElement("root");
Element test = root.addElement("test").addText("£,®");
try {
Writer w = new FileWriter("tmp.xml");
document.write(w);
w.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
The result document is:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<test>&#xA3;,&#xAE;</test>
</root>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]