Re: Encoding scheme with DOMWriter

Neil Graham Fri, 21 Mar 2003 15:16:28 -0800

Hi Elliot,

I'd defer to PeiYong on this one, since he's done most of the serialization
work.  But my thought is that the rawBuffer would be in whatever encoding
you set on the Document that you serialized into the MemBufFormatTarget (or
the encoding of the original document of course if you haven't touched the
encoding explicitly).  So if you can determine what the native encoding is,
then you should be able to induce the rawBuffer to reflect that encoding.


I am curious about one thing though:  If you're producing XML, then why do
you care what encoding the byte stream is in?  If you're producing XML for
another application then it follows that application must be XML-aware; if
so, it must understand UTF-8 and UTF-16--at least that's what the XML spec
implies.  If it understands UTF-16, then the original array of XMLChars
that writeToString handed to you should be enough...

Cheers!
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]




|---------+---------------------------->
|         |           [EMAIL PROTECTED]|
|         |           m                |
|         |                            |
|         |           03/21/2003 03:27 |
|         |           PM               |
|         |           Please respond to|
|         |           xerces-c-dev     |
|         |                            |
|---------+---------------------------->
  
>---------------------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                    
                                                         |
  |       To:       [EMAIL PROTECTED]                                                  
                                               |
  |       cc:                                                                          
                                                         |
  |       Subject:  Re: Encoding scheme with DOMWriter                                 
                                                         |
  |                                                                                    
                                                         |
  |                                                                                    
                                                         |
  
>---------------------------------------------------------------------------------------------------------------------------------------------|




Neil,

Thanks for the help.  That was definitely it.  Now my question is whether
this is what I should be doing.  Basically, since I am writing a wrapper
around this functionality, I need to be able to serialize a node as text
and deliver it as a modified PStr (4 byte length instead of one) in native
code page (b/c it will eventually make its way to a different primary
application).  So, whatever platform I happen to be on, I should be able to
give the user the "xml" associated with a given node.

So, getRawBuffer() returns the internal raw buffer, but I assume that this
is not necessarily in the native code page.

Any thoughts?

Thanks.



|---------+---------------------------->
|         |           "Neil Graham"    |
|         |           <[EMAIL PROTECTED]|
|         |           >                |
|         |                            |
|         |           03/20/2003 04:52 |
|         |           PM               |
|         |           Please respond to|
|         |           xerces-c-dev     |
|         |                            |
|---------+---------------------------->
  >
--------------------------------------------------------------------------------------------------------------------------------------------------|

  |
|
  |       To:       [EMAIL PROTECTED]
|
  |       cc:
|
  |       Subject:  Re: Encoding scheme with DOMWriter
|
  >
--------------------------------------------------------------------------------------------------------------------------------------------------|





Hi Elliot,

The last two lines of your code seem to be the most interesting:

            XMLCh* tempxmltext=theSerializer->writeToString(*m_Node);

The documentation of the writeToString method [1] quite clearly states that
the output will be in UTF-16 and that the document's encoding will be
ignored; so that's why you're seeing UTF-16 output.

      char* xmlchartext = XMLString::transcode(tempxmltext);

And this is transcoding your text for you, but the previous step already
overwrote the encoding information; that is, encoding="UTF0-16" is already
there and this is just transcoding that string, like any other.

If you want a sequence of bytes in memory, why not use the DOMWriter's
writeNode method to write to a MemBufFormatTarget, then use  getRawBuffer
to get yourself an array of XMLBytes (which is typedef'd to unsigned char).
Hopefully that'll meet whatever need is compelling you to want a UTF8 char
array in memory.

Hope that helps,
Neil

[1]:  http://xml.apache.org/xerces-c/apiDocs/classDOMWriter.html#z272_12

Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]




|---------+---------------------------->
|         |           [EMAIL PROTECTED]|
|         |           m                |
|         |                            |
|         |           03/20/2003 05:10 |
|         |           PM               |
|         |           Please respond to|
|         |           xerces-c-dev     |
|         |                            |
|---------+---------------------------->
  >
---------------------------------------------------------------------------------------------------------------------------------------------|


  |
|
  |       To:       [EMAIL PROTECTED]
|
  |       cc:
|
  |       Subject:  Encoding scheme with DOMWriter
|
  |
|
  |
|
  >
---------------------------------------------------------------------------------------------------------------------------------------------|






I am having trouble with the writeToString function of the DOMWriter.
Basically, when I load an xml file using XercesDOMParser parse(path), and
then print the information using writeToString, the decl shows up with the
wrong encoding.  The odd thing is that the DOMDocument getEncoding()
function indicates the correct encoding shown in the document.

For example, in the DOMPrint example, the following:

<?xml version="1.0" encoding="UTF-8"?>
<personnel xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
         xsi:noNamespaceSchemaLocation='personal.xsd'>

shows up as:

<?xml version="1.0" encoding="UTF-16" standalone="no" ?><personnel
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:noNamespaceSchemaLocation="personal.xsd">

Am I doing something wrong.  Based on the DOMPrint example, it seems as
though I shouldn't have to explicitly set the encoding if it is called out
in the document itself.

The code is a little convoluted as I am building a wrapper dll around
xerces to use it in another program, but here is the part that writes the
xml to string:

      XMLCh tempStr[100];
      XMLString::transcode("LS", tempStr, 99);
      DOMImplementation *impl =
DOMImplementationRegistry::getDOMImplementation(tempStr);
      DOMWriter *theSerializer = ((DOMImplementationLS*)impl)
->createDOMWriter();
      char *outputencoding =
XMLString::transcode(theSerializer->getEncoding());
      XMLCh* tempxmltext=theSerializer->writeToString(*m_Node);
      char* xmlchartext = XMLString::transcode(tempxmltext);

Thanks!



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Encoding scheme with DOMWriter

Reply via email to