Hi all,
  I've been struggling and bugging folks here for a few weeks now with
trying to get some i18n stuff working with tomcat and xerces talking to 
an Oracle backend. I finally have it all working and here's the subsequent
brain dump:

The architecture:

We're implementing the CNRP protocol (http://cnrp.net) which is an XML based
protocol over HTTP. The setup has an externally available Apache 1.3
server with mod_jk from Tomcat 3.3 milestone 1 talking via Ajp12 to
a Tomcat 3.3-M1 installation. I attempted to use Ajp13 but it would
stop giving the servlet a body after the 3rd transaction (the getReader()
call would succeed but nothing would be there). I have one servlet
that takes the xml query thats in the body from the POST-ed request, parses
it and then builds a SQL query to an Oracle 8.1.6 server. 

This is where things got tricky. The APIs in Xerces make you think
that you'll get a properly converted string out but you don't. 
Node.getNodeValue() gives you a String that contains bytes that are
still UTF-8 encoded! You have to do this to get 'em into a real Java
String:

   String newCN = new String(query.getCommonName().getBytes(), "UTF-8");

As I clean this up I'll probably wrap all of my Xerces calls that
return String values with this:

    String foo = new String(Node.getNodeValue().getBytes(), "UTF-8");

We then take this value and build a SQL query the standard way:

  String querystring = "select * from cnrp where cn LIKE '" + newCN + "%'";

Be careful here since the match semantics here are governed by the
various NLS parameters in Oracle such as the language and country
so that your ordering and matching is accurate. 

Then we build the results XML document from the SQL ResultSet. This is
fairly straightforward as the SQL ResultSet and the Oracle driver handle
the UTF-8 to UCS2 conversion needed between the database server and
the servlet. If you aren't using the Oracle drivers then you have to 
find out if it handles conversions between the database character set
and the JVM character encoding. The real issue is when you serialize the 
document out.  This is what we're doing since the standard says the document 
is in UTF-8. 'out' here is the SerlvetOutputStream that we got from the
servlet's response object.

        OutputFormat format = new OutputFormat(document);
        format.setEncoding("UTF-8");
        XMLSerializer serializer = new XMLSerializer(out, format);
        serializer.serialize(document.getDocumentElement());
        out.flush();
        out.close();

Before you do this though you have to set the response's encoding like this:
        response.setContentType("text/xml; charset=UTF-8");

The things I've learned form all of this are:

1) the hardest part of dealing with Unicode is when APIs intimate
that they 'handle Unicode' but are never clear about what encodings
they can handle. 

2) APIs that do encoding conversions for you should be very clear about
exactly what happens where. If something can do a conversion for the 
caller then make it very clear about whether or not you are
converting the values into the encoding specified or out of the
encoding specified.

3) Oracle and the JDBC APIs are actually doing the right thing even though
they don't document it very well.

4) The Xerces documentation should be updated to either be clear about
not doing conversions or it should provide APIs that provide for
the conversion....


-MM

-- 
--------------------------------------------------------------------------------
Michael Mealling        |      Vote Libertarian!       | www.rwhois.net/michael
Sr. Research Engineer   |   www.ga.lp.org/gwinnett     | ICQ#:         14198821
Network Solutions       |          www.lp.org          |  [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Reply via email to