thanks for replying,
managed to work this out, it seems as though the gSOAP framework was
modifying the 8bit UTF-8 characters into 7bit strings ie Ä causing
xalan to fail to load the document, the solution was to set the following in
gSOAP
soap_init2(&soap, SOAP_C_UTFSTRING,SOAP_C_UTFSTRING);
this stops gets gSOAP to not alter 8bit chars
Cheers
Paul
----- Original Message -----
From: <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, June 17, 2003 4:46 PM
Subject: Re: Xalan-C++ and UTF-8 with non ascii characters
>
>
>
>
> > hi,
> > i am using xalan-c++ to perform XPath queries on an XML document,
All
> > works fine except some non ascii characters when encoded as UTF-8 cause
> an
> > exception in theliaison->parseXMLStream();
>
> I suggest you catch the exception and take a look at the error message.
> Without that, it will be impossible to diagnose the problem. Start with
> catching SAXParseException, because that's probably what's being thrown.
>
> > A example problematic character is the german umlaut. The XML its
> trnsported
> > over http/SOAP from a VB application to Xalan-C++ using gSOAP. looking
at
> > the encoding of the umlaut character shows it is sent from VB as two
> bytes
> > (hex) C3 84 - (decimal) 195 132
>
> The two bytes C3 84 in UTF-8 encode the Unicode character U+00C4, Latin
> Capital Letter A With Diaeresis, or capital A with an umlaut. Is that the
> character you're expecting?
>
> > however if i return the same character created from the Xerces-C++ DOM
> this character is encoded as Ä.
>
> What do you mean by "if i return the same character created from the
> Xerces-C++ DOM?" How did you create this instance? Did you parse it? If
> not, that DOM instance probably isn't relevant to the discussion. Do you
> mean you are serializing an instance of the DOM, and you are getting those
> two characters? If that's the case, you have an encoding problem,
because,
> in UTF-16, you are getting U+00C3 (Latin Capital Letter A With Tilde) and
> U+0132, which is a control character.
>
> My understanding of VB, which is extremely limited, is that strings are
> encoded in UCS-2, not UTF-8. You may have a problem with parsing a
> document which contains an encoding declaration asserting the document is
> in UTF-8, when it really is UCS-2.
>
> Dave
>
>