On 11/05/09 06:31, David Lu wrote:
I have some Java code that converts Microsoft Word documents
to plain text. It works fine. However, I do have a problem
with some special characters such as the left and right double
quotes, the trademark and copyright symbols, etc., that are
not working as I expect.
Basically, for a Word document, when I use the GUI and Save As
the document as Text (encoded) and UTF-8, I get all the special
characters in the output file.
However, when I use Java and call storeAsURL() with the same
input file, using Text (encoded) for FilterName and UTF-8
for FilterOptions, some of the characters, namely the trademark
and copyright symbols, and a few others, are saved as question
marks.
I've also tried using Windows-1252/WinLatin 1 as the encoding
with the same results.
The Save As from GUI seems to work better than calling
storeAsURL() in terms of preserving more characters. But the
documentation for storeAsURL() seems to indicate it's the same
as Save As. So do I need to specify additional properties
for storeAsURL()?
This sounds strange, and I suggest you file a bug for it.
Windows-1252 has additional characters compared to ISO 8859-1, in the
range 0x80--0x9F, and at first it sounded like that fact might somehow
be related to the problem. However, trademark (U+2122) is in that area
(0x99) while copyright (U+00A9) is not (0xA9), yet you say that both
have the problem...
-Stephan
-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org