Thank you for your quick answer, this idea solved half of my problem.
Now transcode works fine and I'm getting back a string in the local codepage.


But one problem remains:
I need multilingual support in my program (->UTF-8) but using setlocale (LC_ALL,"de_DE") element->getAttribute(uni_attr) returns a XMLCh which already seems to contain ISO-8859-1 characters.
What I need is UTF-8 characters as my program is supposed to use those internally (for database querys, etc).
I tryed setlocale (LC_ALL,"de_DE.UTF-8") but that leads me back to my original problems.


Is there a way to get the value out of an UTF-8 encoded document in UTF-8 characters?

Thanx,
Phil


Holger Fl�rke wrote:

How did you initialized Xerces-C library?

I had the same problem before I set up my current locale explicitly. Here is my code:

"""
  setlocale (LC_ALL,"de_DE");

  // Damit es keine Rechenprobleme gibt, weil im iso-8859-1
  //  ein Komma als Dezimalseparator vorgesehen ist,
  //  wird die Behandlung von Zahlen wieder auf "C" zurueckgesetz.
  //
  // Vorallem fuer den Xalan-C ist diese Einstellung z.Zt. Pflicht.
  setlocale (LC_NUMERIC,"C");

  XERCESC_NS XMLPlatformUtils::Initialize ();
"""

Maybe this helps. Works for me within the following (small) example:

"""
#include <xercesc/util/PlatformUtils.hpp>
#include <xercesc/util/XMLString.hpp>
using namespace xercesc;

#include <locale.h>

#include <iostream>
using namespace std;

int main(int argc, char* argv[])
{
  // try this
  setlocale(LC_ALL,"de_DE");
  setlocale(LC_NUMERIC,"C");

  xercesc_2_4::XMLPlatformUtils::Initialize();

  XMLCh nameWithoutUmlaut[] = {
    (XMLCh) 'H',(XMLCh) 'a',(XMLCh) 'l',(XMLCh) 'l',(XMLCh) 'o',0
  };

XMLCh nameWithUmlaut[] = {
(XMLCh) 'H',(XMLCh) 'a',(XMLCh) 'l',(XMLCh) 'l',0xF6,(XMLCh) 'l',(XMLCh) 'e',0
};


  char * value = xercesc::XMLString::transcode(nameWithoutUmlaut);

  cout << value << endl;

  delete[] value;

  value = xercesc::XMLString::transcode(nameWithUmlaut);

  cout << value << endl;

  delete[] value;

  XMLPlatformUtils::Terminate();

  return 0;
}
"""
Philip Gross schrieb:

I am parsing a XML file encoded in UTF-8:

<?xml version='1.0' encoding='UTF-8' ?>
<TEXT>
 <SENTENCE Type=".">
  <WORD Orth="Hallöle" PInt="0" PMode=""></WORD>
  <WORD Orth="Halloele" PInt="0" PMode=""></WORD>
  <WORD Orth="Welt" PInt="5" PMode="."></WORD>
 </SENTENCE>
</TEXT>

Hallöle in line 4 is the UTF-8 encoding for Hall�le (with german Umlaut o).
Now I want to get the value of attribute 'Orth' in the first element 'WORD'.
The XMLCh uni_value returned from getAttribute unfortunately does not
contain Unicode, but characters in the local codepage, as I found out by
looking at the memory (last line in following code):


string attr = "Orth";
XMLCh * uni_attr = xercesc::XMLString::transcode(attr.c_str());
const XMLCh * uni_value = element->getAttribute(uni_attr);
char * value = xercesc::XMLString::transcode(uni_value);
cout << value << endl;
for(unsigned i = 0; i < xercesc::XMLString::stringLen(uni_value); i++) cout << i << (char)uni_value[i];


The cout shows nothing at all, though the umlaut � should be transcoded
proberly because it is contained in the local codepage (ISO8859-1).
I read that XMLCh corresponds to UTF-16, but the memory dump shows "Hall�le"
in the local codepage.
parser->getDocument()->getActualEncoding() also returns 'UTF-8', so the
document is recognized correctly.
I don't see what's the error, maybe someone does or has had the same
problem...


Thanx in advance,

Philip Gross

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to