How did you initialized Xerces-C library?
I had the same problem before I set up my current locale explicitly. Here is my code:
""" setlocale (LC_ALL,"de_DE");
// Damit es keine Rechenprobleme gibt, weil im iso-8859-1 // ein Komma als Dezimalseparator vorgesehen ist, // wird die Behandlung von Zahlen wieder auf "C" zurueckgesetz. // // Vorallem fuer den Xalan-C ist diese Einstellung z.Zt. Pflicht. setlocale (LC_NUMERIC,"C");
XERCESC_NS XMLPlatformUtils::Initialize (); """
Maybe this helps. Works for me within the following (small) example:
""" #include <xercesc/util/PlatformUtils.hpp> #include <xercesc/util/XMLString.hpp> using namespace xercesc;
#include <locale.h>
#include <iostream> using namespace std;
int main(int argc, char* argv[])
{
// try this
setlocale(LC_ALL,"de_DE");
setlocale(LC_NUMERIC,"C");xercesc_2_4::XMLPlatformUtils::Initialize();
XMLCh nameWithoutUmlaut[] = {
(XMLCh) 'H',(XMLCh) 'a',(XMLCh) 'l',(XMLCh) 'l',(XMLCh) 'o',0
};XMLCh nameWithUmlaut[] = {
(XMLCh) 'H',(XMLCh) 'a',(XMLCh) 'l',(XMLCh) 'l',0xF6,(XMLCh) 'l',(XMLCh) 'e',0
};
char * value = xercesc::XMLString::transcode(nameWithoutUmlaut);
cout << value << endl;
delete[] value;
value = xercesc::XMLString::transcode(nameWithUmlaut);
cout << value << endl;
delete[] value;
XMLPlatformUtils::Terminate();
return 0; } """ Philip Gross schrieb:
I am parsing a XML file encoded in UTF-8:
<?xml version='1.0' encoding='UTF-8' ?> <TEXT> <SENTENCE Type="."> <WORD Orth="Hallöle" PInt="0" PMode=""></WORD> <WORD Orth="Halloele" PInt="0" PMode=""></WORD> <WORD Orth="Welt" PInt="5" PMode="."></WORD> </SENTENCE> </TEXT>
Hallöle in line 4 is the UTF-8 encoding for Hall�le (with german Umlaut o).
Now I want to get the value of attribute 'Orth' in the first element 'WORD'.
The XMLCh uni_value returned from getAttribute unfortunately does not
contain Unicode, but characters in the local codepage, as I found out by
looking at the memory (last line in following code):
string attr = "Orth";
XMLCh * uni_attr = xercesc::XMLString::transcode(attr.c_str());
const XMLCh * uni_value = element->getAttribute(uni_attr);
char * value = xercesc::XMLString::transcode(uni_value);
cout << value << endl;
for(unsigned i = 0; i < xercesc::XMLString::stringLen(uni_value); i++) cout << i << (char)uni_value[i];
The cout shows nothing at all, though the umlaut � should be transcoded
proberly because it is contained in the local codepage (ISO8859-1).
I read that XMLCh corresponds to UTF-16, but the memory dump shows "Hall�le"
in the local codepage.
parser->getDocument()->getActualEncoding() also returns 'UTF-8', so the
document is recognized correctly.
I don't see what's the error, maybe someone does or has had the same
problem...
Thanx in advance,
Philip Gross
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
-- holger floerke d o c t r o n i c email [EMAIL PROTECTED] information publishing + retrieval phone +49 2222 9292 90 http://www.doctronic.de
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
