I am parsing a XML file encoded in UTF-8:
<?xml version='1.0' encoding='UTF-8' ?> <TEXT> <SENTENCE Type="."> <WORD Orth="Hallöle" PInt="0" PMode=""></WORD> <WORD Orth="Halloele" PInt="0" PMode=""></WORD> <WORD Orth="Welt" PInt="5" PMode="."></WORD> </SENTENCE> </TEXT>
Hallöle in line 4 is the UTF-8 encoding for Hall�le (with german Umlaut o). Now I want to get the value of attribute 'Orth' in the first element 'WORD'. The XMLCh uni_value returned from getAttribute unfortunately does not contain Unicode, but characters in the local codepage, as I found out by looking at the memory (last line in following code):
string attr = "Orth";
XMLCh * uni_attr = xercesc::XMLString::transcode(attr.c_str());
const XMLCh * uni_value = element->getAttribute(uni_attr);
char * value = xercesc::XMLString::transcode(uni_value);
cout << value << endl;
for(unsigned i = 0; i < xercesc::XMLString::stringLen(uni_value); i++) cout << i << (char)uni_value[i];
The cout shows nothing at all, though the umlaut � should be transcoded proberly because it is contained in the local codepage (ISO8859-1). I read that XMLCh corresponds to UTF-16, but the memory dump shows "Hall�le" in the local codepage. parser->getDocument()->getActualEncoding() also returns 'UTF-8', so the document is recognized correctly. I don't see what's the error, maybe someone does or has had the same problem...
Thanx in advance,
Philip Gross
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
