I am parsing a XML file encoded in UTF-8:

<?xml version='1.0' encoding='UTF-8' ?>
<TEXT>
 <SENTENCE Type=".">
  <WORD Orth="Hallöle" PInt="0" PMode=""></WORD>
  <WORD Orth="Halloele" PInt="0" PMode=""></WORD>
  <WORD Orth="Welt" PInt="5" PMode="."></WORD>
 </SENTENCE>
</TEXT>

Hallöle in line 4 is the UTF-8 encoding for Hall�le (with german Umlaut o).
Now I want to get the value of attribute 'Orth' in the first element 'WORD'.
The XMLCh uni_value returned from getAttribute unfortunately does not
contain Unicode, but characters in the local codepage, as I found out by
looking at the memory (last line in following code):

string attr = "Orth";
XMLCh * uni_attr = xercesc::XMLString::transcode(attr.c_str());
const XMLCh * uni_value = element->getAttribute(uni_attr);
char * value = xercesc::XMLString::transcode(uni_value);
cout << value << endl;
for(unsigned i = 0; i < xercesc::XMLString::stringLen(uni_value); i++) cout << i << (char)uni_value[i];


The cout shows nothing at all, though the umlaut � should be transcoded
proberly because it is contained in the local codepage (ISO8859-1).
I read that XMLCh corresponds to UTF-16, but the memory dump shows "Hall�le"
in the local codepage.
parser->getDocument()->getActualEncoding() also returns 'UTF-8', so the
document is recognized correctly.
I don't see what's the error, maybe someone does or has had the same
problem...

Thanx in advance,

Philip Gross

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to