"The microsoft file formats generally store text as either US-ASCII or UCS-2. The type of the record/block/etc tells you which it is, so we can turn that into java (unicode) strings"
Thanks for the input Nick. But one thing is still not clear, can I encode the text as UTF_8? When trying to extract non-english text like french, japanese etc, the output is incomprehensible. Is there any way encode non-english fonts using POI? Regards, Som On Tue, Sep 8, 2009 at 3:18 PM, Nick Burch <[email protected]> wrote: > On Tue, 8 Sep 2009, Som Satpathy wrote: > >> Does apache POI follow any particular encoding internally while extracting >> MS office documents? If so what is the encoding that POI uses? >> > > POI is written in Java, so uses native java strings almost everywhere. > These are unicode > > The microsoft file formats generally store text as either US-ASCII or > UCS-2. The type of the record/block/etc tells you which it is, so we can > turn that into java (unicode) strings > > Nick > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
