Character encoding for each character in Word Document

Doppelhofer Andreas Mon, 18 Jan 2010 03:29:34 -0800

hi all,
i have a question getting character encoding for each character (ascii,
unicode, iso-8859-5...) in a Word Document.
Following code snippet extractes the text and convert it into a "hard
coded" Charset Buffer.
Is there a way to get the correct character encoding dynamically?
Say, the first character "a" is ISO-8859-1 and the second is a russian
character (like iso-8859-5) and so on.
 
fs = new POIFSFileSystem(new FileInputStream("test.doc"));
HWPFDocument mydoc = null;


mydoc = new HWPFDocument(fs);
Range myrange = mydoc.getRange();

for (int i = 0; i < myrange.numParagraphs(); i++) {
  Paragraph myparagraph = myrange.getParagraph(i);
  String mytext = myparagraph.text();

  Charset charset = Charset.forName("ISO-8859-5");  // "hard coded" :-(
  CharsetDecoder decoder = charset.newDecoder();

  ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(mytext));

  // do something with bbuf
}

Thx dops

-- 


Salomon Automation GmbH - Friesachstrasse 15 - A-8114 Friesach bei Graz
Sitz der Gesellschaft: Friesach bei Graz
UID-NR:ATU28654300 - Firmenbuchnummer: 49324 K
Firmenbuchgericht: Landesgericht fur Zivilrechtssachen Graz


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Character encoding for each character in Word Document

Reply via email to