Re: XMLCh & wchar_t conversion on multiple platforms

Andy Heninger Mon, 14 May 2001 14:23:53 -0700

Here is a proposal for wchar_t conversions from Markus Scherer on the ICU mailing list.

From: "Markus Scherer" <[EMAIL PROTECTED]>
To: "icu list" <[EMAIL PROTECTED]>
Sent: Friday, May 11, 2001 1:32 PM
Subject: icu api proposal: in-process string transformations
UChar*<->UTF-8/32/wchar_t*

This is a kind of FAQ:
"ICU processes strings in UTF-16, but my XYZ API uses UTF-8/32/wchar_t*. What do I do?"

This is especially interesting because the UTF transformations are trivial and fast, and because the wchar_t transformation on many platforms today is just a UTF transformations. Providing functions that portably perform these commonly requested transformations and do the legwork when wchar_t is not Unicode seems like a useful feature.

I propose the following 6 functions:

wchar_t *u_strToWCS(wchar_t *dest, int32_t destCapacity,
                    int32_t *pDestLength,
                    const UChar *src, int32_t srcLength,
                    UErrorCode *pErrorCode);

UChar *u_strFromWCS(UChar *dest, ...);

uint8_t *u_strToUTF8(uint8_t *dest, ...);
UChar *u_strFromUTF8(UChar *dest, ...);

uint32_t *u_strToUTF32(uint32_t *dest, ...);
UChar *u_strFromUTF32(UChar *dest, ...);

I propose this not to be part of the converter API. These functions work on process-internal string encodings, while converters are designed for external encodings. There is no buffer management here, and the UTF transformations will use our UTF macros.

Details of semantics:
- The functions always write a NUL termination if destCapacity is sufficient.
- If srcLength==-1 then u_strlen(src) is used as usual. In this case, if there is not enough destCapacity for the NUL, then a U_BUFFER_OVERFLOW_ERROR is set.
- If srcLength>=0 and only the NUL does not fit, then no error code is set.
- If any character except for the automatic NUL does not fit, then a U_BUFFER_OVERFLOW_ERROR is always set.
- All functions always write to the dest buffer.
Note that this would not be necessary when wchar_t carries UTF-16 anyway as on Win32. However, for consistent behavior, the WCS functions will still memcpy().

Expiration: Friday, 2001-may-17

markus
_______________________________________________
icu mailing list
[EMAIL PROTECTED]
http://oss.software.ibm.com/developerworks/opensource/mailman/listinfo/icu

There's a bit of discussion on the topic going on over there; follow the links above if you are interested. In the API proposal, UChar is a 16 bit utf-16 encoded character, and thus would be completely interoperable with XMLCh.

Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]

----- Original Message -----

From: "Andy Heninger" <[EMAIL PROTECTED]>

To: <[EMAIL PROTECTED]>

Sent: Wednesday, May 02, 2001 4:55 PM

Subject: Re: XMLCh & wchar_t conversion on multiple platforms

> wchar_t is messy. For the platforms you mentioned, if sizeof(wchar_t) ==
> 2 wchar_t will be utf-16. If the size is 4 bytes and __STDC_ISO_10646__
> is defined, wchar_t is UCS4. I think. But this definitely does not cover
> all possible platforms.
>
> If you know that your Unicode data has no code points > 64k, you can do a
> quick and dirty conversion to UCS4 by just unpacking the 16 bit values
> into 32 bits, with the hi bytes being zero.
>
> You'd think that there would be simple to use library functions for
> converting to/from wchar_t, but there don't seem to be. I'm lobbying to
> get one added to ICU.
>
> Andy Heninger
> IBM, Cupertino, CA
> [EMAIL PROTECTED]
>
>
> ----- Original Message -----
> From: "Mark A Russell" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Tuesday, May 01, 2001 12:00 PM
> Subject: RE: XMLCh & wchar_t conversion on multiple platforms
>
>
> > So am I correct then in assuming that I will need to instantiate a
> > transcoder of type ICU or Iconv just to do the conversion? If this is
> the
> > case then what are the encodingName 's that the constructors take, the
> > UConverter that ICU takes, and the block size that Iconv takes?
> >
> > Is there some sample code out there that gives a simple case of how this
> > works?
> > Also how do you go about determining wchar_t format? (Beyond just using
> > #ifdef's )
> >
> > Thanks,
> >
> > Mark R
> >
> > -----Original Message-----
> > From: Andy Heninger [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, May 01, 2001 10:31 AM
> > To: [EMAIL PROTECTED]
> > Subject: Re: XMLCh & wchar_t conversion on multiple platforms
> >
> >
> > wchar_t seems to be perpetually awkward, largely because its definition
> > varies so much from platform to platform. You will end up with some
> > platform specific code to find the local wchar_t format. Once you have
> > that you can use either iconv (UNIXes), ICU converters (all platforms,
> > assuming you have ICU around), or nothing (when wchar_t encoding is
> > utf-16) to get from utf-16 encoded XMLCh strings to wchar_t strings.
> >
> >
> > Andy Heninger
> > IBM, Cupertino, CA
> > [EMAIL PROTECTED]
> >
> > ----- Original Message -----
> > From: "Mark A Russell" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Tuesday, May 01, 2001 6:50 AM
> > Subject: RE: XMLCh & wchar_t conversion on multiple platforms
> >
> >
> > > That seems to be the issue I'm running into, but I can't seem to
> figure
> > out
> > > how to do the transcoding. I've looked through the docs, and more
> > > importantly the headers and the closest thing I can find is the
> > transcodeTo
> > > and transcodeFrom functions. The issue I have with those is that you
> > have
> > > to determine which Transcoder to use, ie Iconv or ICU, you have to
> know
> > the
> > > unicode type when you instantiate the transcoder, and also they are
> not
> > > static functions. Meaning I have to instantiate a transcoder just to
> do
> > > some conversions.
> > >
> > > Surely there is a simpler way to do the transcoding?
> > >
> > > Mark A Russell
> > > NextGen Software Engineer
> > > CSG Systems, Inc.
> > > E-Mail: [EMAIL PROTECTED]
> > >
> > >
> > > -----Original Message-----
> > > From: Dean Roddey [mailto:[EMAIL PROTECTED]]
> > > Sent: Monday, April 30, 2001 4:44 PM
> > > To: '[EMAIL PROTECTED]'
> > > Subject: RE: XMLCh & wchar_t conversion on multiple platforms
> > >
> > >
> > > A decision was made a while back, which I didn't really agree with, to
> > fix
> > > XMLCh to UTF-16 on all platforms. Partly this was because the DOM
> > committee
> > > chose UTF-16 for its representation. So, if this is not compatible
> with
> > your
> > > wchar_t, you must transcode all of the data to your local wide string
> > > representation before using it. On NT, the stuff spit out from the
> > parser is
> > > directly useable, since UTF-16 is NT's native representation of
> Unicode.
> > On
> > > other platforms, you'll have to transcode if they don't do the same.
> > >
> > > --------------
> > > Dean Roddey
> > > Software Geek Extraordinaire
> > > Portal, Inc
> > > [EMAIL PROTECTED]
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Mark A Russell [mailto:[EMAIL PROTECTED]]
> > > Sent: Monday, April 30, 2001 3:25 PM
> > > To: [EMAIL PROTECTED]
> > > Subject: XMLCh & wchar_t conversion on multiple platforms
> > >
> > >
> > > Is there a way to convert between XMLCh and wchar_t on both the AIX
> 4.3
> > &
> > > Solaris platform that won't break my code on NT?
> > >
> > > I have some code that I'm trying to port from win32 that uses wchar_t
> > for
> > > unicode support. This code currently makes use of some of the xerces
> > > functions that only take XMLCh 's. An example is shown below:
> > >
> > > const wchar_t * szSourceBinding =
> > > attributes.getValue(CBOITagFactory::ATTR_SOURCE_BINDING);
> > >
> > > The CBOITagFactory::ATTR_SOURCE_BINDING is simply a wchar_t. (XMLCh's
> > are
> > > currently unsigned shorts)
> > >
> > > My requirement is to maintain unicode support on all three platforms.
> I
> > > thought about just redefining XMLCh's to wchar_t's like they used to
> be
> > > around 1.2, however after looking at the documentation that seems like
> a
> > > very bad idea because of an incompatibility that would arise on the
> > Solaris
> > > platform.
> > >
> > > Any help would be much appreciated.
> > >
> > > btw - What happen to the mailing list archives? They seem to be
> > unreachable.
> > >
> > > Mark A Russell
> > > NextGen Software Engineer
> > > CSG Systems, Inc.
> > > E-Mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: XMLCh & wchar_t conversion on multiple platforms

Reply via email to