Now transcode works fine and I'm getting back a string in the local codepage.
But one problem remains:
I need multilingual support in my program (->UTF-8) but using setlocale (LC_ALL,"de_DE") element->getAttribute(uni_attr) returns a XMLCh which already seems to contain ISO-8859-1 characters.
What I need is UTF-8 characters as my program is supposed to use those internally (for database querys, etc).
I tryed setlocale (LC_ALL,"de_DE.UTF-8") but that leads me back to my original problems.
Is there a way to get the value out of an UTF-8 encoded document in UTF-8 characters?
Thanx, Phil
Holger Fl�rke wrote:
How did you initialized Xerces-C library?
I had the same problem before I set up my current locale explicitly. Here is my code:
""" setlocale (LC_ALL,"de_DE");
// Damit es keine Rechenprobleme gibt, weil im iso-8859-1 // ein Komma als Dezimalseparator vorgesehen ist, // wird die Behandlung von Zahlen wieder auf "C" zurueckgesetz. // // Vorallem fuer den Xalan-C ist diese Einstellung z.Zt. Pflicht. setlocale (LC_NUMERIC,"C");
XERCESC_NS XMLPlatformUtils::Initialize (); """
Maybe this helps. Works for me within the following (small) example:
""" #include <xercesc/util/PlatformUtils.hpp> #include <xercesc/util/XMLString.hpp> using namespace xercesc;
#include <locale.h>
#include <iostream> using namespace std;
int main(int argc, char* argv[]) { // try this setlocale(LC_ALL,"de_DE"); setlocale(LC_NUMERIC,"C");
xercesc_2_4::XMLPlatformUtils::Initialize();
XMLCh nameWithoutUmlaut[] = { (XMLCh) 'H',(XMLCh) 'a',(XMLCh) 'l',(XMLCh) 'l',(XMLCh) 'o',0 };
XMLCh nameWithUmlaut[] = {
(XMLCh) 'H',(XMLCh) 'a',(XMLCh) 'l',(XMLCh) 'l',0xF6,(XMLCh) 'l',(XMLCh) 'e',0
};
char * value = xercesc::XMLString::transcode(nameWithoutUmlaut);
cout << value << endl;
delete[] value;
value = xercesc::XMLString::transcode(nameWithUmlaut);
cout << value << endl;
delete[] value;
XMLPlatformUtils::Terminate();
return 0; } """ Philip Gross schrieb:
I am parsing a XML file encoded in UTF-8:
<?xml version='1.0' encoding='UTF-8' ?> <TEXT> <SENTENCE Type="."> <WORD Orth="Hallöle" PInt="0" PMode=""></WORD> <WORD Orth="Halloele" PInt="0" PMode=""></WORD> <WORD Orth="Welt" PInt="5" PMode="."></WORD> </SENTENCE> </TEXT>
Hallöle in line 4 is the UTF-8 encoding for Hall�le (with german Umlaut o).
Now I want to get the value of attribute 'Orth' in the first element 'WORD'.
The XMLCh uni_value returned from getAttribute unfortunately does not
contain Unicode, but characters in the local codepage, as I found out by
looking at the memory (last line in following code):
string attr = "Orth";
XMLCh * uni_attr = xercesc::XMLString::transcode(attr.c_str());
const XMLCh * uni_value = element->getAttribute(uni_attr);
char * value = xercesc::XMLString::transcode(uni_value);
cout << value << endl;
for(unsigned i = 0; i < xercesc::XMLString::stringLen(uni_value); i++) cout << i << (char)uni_value[i];
The cout shows nothing at all, though the umlaut � should be transcoded
proberly because it is contained in the local codepage (ISO8859-1).
I read that XMLCh corresponds to UTF-16, but the memory dump shows "Hall�le"
in the local codepage.
parser->getDocument()->getActualEncoding() also returns 'UTF-8', so the
document is recognized correctly.
I don't see what's the error, maybe someone does or has had the same
problem...
Thanx in advance,
Philip Gross
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
