thanks for replying, managed to work this out, it seems as though the gSOAP framework was modifying the 8bit UTF-8 characters into 7bit strings ie Ä causing xalan to fail to load the document, the solution was to set the following in gSOAP
soap_init2(&soap, SOAP_C_UTFSTRING,SOAP_C_UTFSTRING); this stops gets gSOAP to not alter 8bit chars Cheers Paul ----- Original Message ----- From: <[EMAIL PROTECTED]> To: <xalan-c-users@xml.apache.org> Sent: Tuesday, June 17, 2003 4:46 PM Subject: Re: Xalan-C++ and UTF-8 with non ascii characters > > > > > > hi, > > i am using xalan-c++ to perform XPath queries on an XML document, All > > works fine except some non ascii characters when encoded as UTF-8 cause > an > > exception in theliaison->parseXMLStream(); > > I suggest you catch the exception and take a look at the error message. > Without that, it will be impossible to diagnose the problem. Start with > catching SAXParseException, because that's probably what's being thrown. > > > A example problematic character is the german umlaut. The XML its > trnsported > > over http/SOAP from a VB application to Xalan-C++ using gSOAP. looking at > > the encoding of the umlaut character shows it is sent from VB as two > bytes > > (hex) C3 84 - (decimal) 195 132 > > The two bytes C3 84 in UTF-8 encode the Unicode character U+00C4, Latin > Capital Letter A With Diaeresis, or capital A with an umlaut. Is that the > character you're expecting? > > > however if i return the same character created from the Xerces-C++ DOM > this character is encoded as Ä. > > What do you mean by "if i return the same character created from the > Xerces-C++ DOM?" How did you create this instance? Did you parse it? If > not, that DOM instance probably isn't relevant to the discussion. Do you > mean you are serializing an instance of the DOM, and you are getting those > two characters? If that's the case, you have an encoding problem, because, > in UTF-16, you are getting U+00C3 (Latin Capital Letter A With Tilde) and > U+0132, which is a control character. > > My understanding of VB, which is extremely limited, is that strings are > encoded in UCS-2, not UTF-8. You may have a problem with parsing a > document which contains an encoding declaration asserting the document is > in UTF-8, when it really is UCS-2. > > Dave > >