Hey Erskine :)

I don't use JDOM to build DOM trees, but the following code works for me to
get the right thing from an HTML file I write out:

import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.apache.xml.serialize.*;

public class Crap
{

      public static void main(String[] args)
      {
            try
            {
                  DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
                  dbf.setNamespaceAware(true);

                  DocumentBuilder db = dbf.newDocumentBuilder();

                  Document doc = db.newDocument();
                  Element root = doc.createElementNS("", "root");
                  doc.appendChild(root);

                  Element test = doc.createElementNS("", "test");
                  root.appendChild(test);

                  test.appendChild(doc.createTextNode("" + ((char)0x20AC) +
"," + ((char)0xAE)));

                  OutputFormat of = new OutputFormat("xml", "utf-8", true);
                  XMLSerializer ser = new XMLSerializer(new FileWriter(
"c:/test.html"), of);

                  ser.serialize(doc);
            }
            catch(Exception e)
            {
                  e.printStackTrace();
            }
      }
}

The point being: send the characters through as-is.  Don't worry about
escaping them yourself, as that Xerces' job.  Note that your IDE might
print gibberish if it doesn't have unicode-capable fonts and you redirect
to System.out rather than the file!

Hope this helps,
C



                                                                                
                                                                
                                                                                
                                                                
                                                        To:       "'[EMAIL 
PROTECTED]'" <[EMAIL PROTECTED]>               
                      "Williams, Erskine BGI SF"        cc:       (bcc: 
Constantine Georges/Towers Perrin)                                      
                      <[EMAIL PROTECTED]        Subject:  xerces always escapes 
ampersands                                              
                      global.com>                                               
                                                                
                                                                                
                                                                
                      08/04/2003 02:33 PM                                       
                                                                
                      Please respond to                                         
                                                                
                      xerces-j-user                                             
                                                                
                                                                                
                                                                
                                                                                
                                                                




I'm finding that xerces is always escaping ampersands, even when they are a
part of a character reference. For example, if I want to define a text
element like so: <someText>&#x20AC</someText>, (where "&#x20AC;" is the
hexadecimal entity reference for the euro "EUR" sign) when xerces writes
this out to a file, I invariably get: "<someText>&amp;#x20AC;</someText>"
Xerces is always escaping ampersands into the entity ref "&amp;"

Perhaps my confusion arises out of poor understanding of xml, but I should
think that xerces would only escape ampersands that aren't a part of a
valid
entity reference, i.e., if an ampersand is immediately followed by a pound
(#) sign, it should leave it alone. Is there a more reliable way to
reference extended ascii characters in xml, so that they will pass through
xerces unmolested?

I use castor and dom4j to manipulate my xml in my application, but these
both use Xerces under the covers if I am not mistaken. Some simple test
cases are below. Any guidance is very much appreciated.
Cheers,
Erskine

/***********************
* Castor example
*
************************/
import java.io.FileWriter;
import java.io.File;

import org.exolab.castor.xml.Marshaller;

public class CastorTest {

  public static void main(String [] args) {

    //populate an arbitrary data object with special characters
    Factsheet fs = new Factsheet();
    ContentSections cs = new ContentSections();
    Content c = new Content();
    c.addPara("&#xA3; &#xA9; &#xAE;");
    cs.addContent(c);
    fs.setContentSections(cs);

    //now use the castor marshalling framework to write the data object out
to xml
    try {
      FileWriter fw = new FileWriter(new File("tmp.xml"));
      Marshaller m = new Marshaller(fw);
      m.setEncoding("iso-8859-1");
      m.marshal(fs);
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

The resulting xml file looks like:

<?xml version="1.0" encoding="iso-8859-1"?>
<factsheet>
  <content>
    <para>&amp;#xA3; &amp;#xA9; &amp;#xAE;</para>
  </content>
</factsheet>

/********************************
*
* Dom4J example
*
********************************/
import org.dom4j.Document;
import org.dom4j.DocumentHelper;
import org.dom4j.Element;

import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;

public class JDomTest {

  public static void main(String [] args) {
    Document document = DocumentHelper.createDocument();
    Element root = document.addElement("root");
    Element test = root.addElement("test").addText("&#xA3;,&#xAE;");
    try {
      Writer w = new FileWriter("tmp.xml");
      document.write(w);
      w.close();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

The result document is:

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <test>&amp;#xA3;,&amp;#xAE;</test>
</root>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to