thanks for replying,
    managed to work this out, it seems as though the gSOAP framework was
modifying the 8bit UTF-8 characters into 7bit strings ie Ã&#132 causing
xalan to fail to load the document, the solution was to set the following in
gSOAP

soap_init2(&soap, SOAP_C_UTFSTRING,SOAP_C_UTFSTRING);

this stops gets gSOAP to not alter 8bit chars

Cheers
Paul


----- Original Message -----
From: <[EMAIL PROTECTED]>
To: <xalan-c-users@xml.apache.org>
Sent: Tuesday, June 17, 2003 4:46 PM
Subject: Re: Xalan-C++ and UTF-8 with non ascii characters


>
>
>
>
> > hi,
> >     i am using xalan-c++ to perform XPath queries on an XML document,
All
> > works fine except some non ascii characters when encoded as UTF-8 cause
> an
> > exception in theliaison->parseXMLStream();
>
> I suggest you catch the exception and take a look at the error message.
> Without that, it will be impossible to diagnose the problem.  Start with
> catching SAXParseException, because that's probably what's being thrown.
>
> > A example problematic character is the german umlaut. The XML its
> trnsported
> > over http/SOAP from a VB application to Xalan-C++ using gSOAP. looking
at
> > the encoding of the umlaut character shows it is sent from VB as two
> bytes
> > (hex) C3 84  - (decimal) 195 132
>
> The two bytes C3 84 in UTF-8 encode the Unicode character U+00C4, Latin
> Capital Letter A With Diaeresis, or capital A with an umlaut.  Is that the
> character you're expecting?
>
> > however if i return the same character created from the Xerces-C++ DOM
> this character is encoded as &#195;&#132.
>
> What do you mean by "if i return the same character created from the
> Xerces-C++ DOM?"  How did you create this instance?  Did you parse it?  If
> not, that DOM instance probably isn't relevant to the discussion.  Do you
> mean you are serializing an instance of the DOM, and you are getting those
> two characters?  If that's the case, you have an encoding problem,
because,
> in UTF-16, you are getting U+00C3 (Latin Capital Letter A With Tilde) and
> U+0132, which is a control character.
>
> My understanding of VB, which is extremely limited, is that strings are
> encoded in UCS-2, not UTF-8.  You may have a problem with parsing a
> document which contains an encoding declaration asserting the document is
> in UTF-8, when it really is UCS-2.
>
> Dave
>
>

Reply via email to