RE: Maximum Multi-byte string (Japanese) Length

David N Bertoni/Cambridge/IBM 14 Mar 2003 22:19:56 -0000


Hi Steve,

> Our product generates a XML file with the encoding="Shift_JIS".
> The string I'm interested in does in fact appear, in Japanese, in this
file.

So far, so good, as long the version of Xerces you're using has support for
Shift-JIS.  I don't think that's true out-of-the-box, except on a Windows
machine which happens to have support for Shift-JIS installed.  I recall
versions of NT 4.0 that didn't, but Windows 2K seems to have it by default.

> Then during the XSLT transformation, we call the xFunction described
below.
> The function takes the UTF-16 string (as you said), changes it to UNICODE
> (by calling TranscodeToLocalCodePage()), processes it, and changes it
back
> to UNICODE (by calling TranscodeFromLocalCodePage()).
> I don't think what the function does is important, as I'm losing the
string
> immediately in line 1 (args[0]->str()).

Two things:

   1. How does this function get the UTF-16 string from the file?  Do you
   parse it with Xerces?  Is the parse successful?  As I said before,
   Xerces may or may not have intrinsic support for Shift-JIS.

   2. You should _never_ transcode a string to the local code page if you
   are concerned about inter-operability.  There is no guarantee you can
   actually do it, and worse, most local code page conversion functions do
   not report an error if they can't transcode the string -- they just stop
   at the first character they cannot represent.  Also, depending on the
   particular compiler and version, the OS version, and the current C++
   locale, you may get different results.  Transcoding back is even more
   difficult, because you may loose characters in the conversion to the
   local code page, so you cannot guarantee the string will come back in
   its original form.

> When you say it is likely a problem with the local code page conversion,
do
> you mean XAlan has trouble doing the conversion of the string from
Shift_JIS
> to UTF-16?

Xalan uses whatever code page conversion routines that Xerces uses, because
there's no way to know from platform to platform what is supported.
Windows has fewer problems because it supports Unicode natively and ships
with mmore robust code page support.  Most of the Unix platforms do not.

> When I experiment with the JA message str value in our product(prior to
XML
> generation), I observed that if I make the string less than 31 Japanese
> characters, it works.  If the string is 31 or more characters, we lose it
> (in my example, it happens to work out that 2 bytes occupy each Japanese
> character...so, for example, the string I see in my English VC++ editor
> shows 62 bytes.  If I delete 2 bytes, then I would have 60 visible
> characters, representing 30 Japanese characters, and this string works).

I don't know if any particular limitation within Xalan on local code page
transcoding with regard to the length of the string.  It may have something
to do with the particular characters at that position in the string.  You
might test by using some dummy strings of various lengths composed of one
character that you know is processed correctly.

For true interoperability, you should consider transcoding local strings
into UTF-16 to operate on them.  Using the local code page is a time-bomb
waiting to happen.  What happens when you start deploying an application on
a machine running an English version of HP that needs to process a
Shift-JIS file?

If you can reproduce this with a small sample piece of code and some
minimal inputs I'd be interested in taking a look at it.

By the way, please make sure your replies are sent to the list.  If you
just reply, the message is sent only to me.

Hope that helps...

Dave
RE: Maximum Multi-byte string (Japanese) Length

Reply via email to