Re: [dev] Save As vs. storeAsURL() text export filter difference

2009-11-09 Thread Stephan Bergmann

On 11/05/09 06:31, David Lu wrote:

I have some Java code that converts Microsoft Word documents
to plain text.  It works fine.  However, I do have a problem
with some special characters such as the left and right double
quotes, the trademark and copyright symbols, etc., that are
not working as I expect.

Basically, for a Word document, when I use the GUI and Save As
the document as Text (encoded) and UTF-8, I get all the special
characters in the output file.

However, when I use Java and call storeAsURL() with the same
input file, using Text (encoded) for FilterName and UTF-8
for FilterOptions, some of the characters, namely the trademark
and copyright symbols, and a few others, are saved as question
marks.

I've also tried using Windows-1252/WinLatin 1 as the encoding
with the same results.

The Save As from GUI seems to work better than calling
storeAsURL() in terms of preserving more characters.  But the
documentation for storeAsURL() seems to indicate it's the same
as Save As.  So do I need to specify additional properties
for storeAsURL()?


This sounds strange, and I suggest you file a bug for it.

Windows-1252 has additional characters compared to ISO 8859-1, in the 
range 0x80--0x9F, and at first it sounded like that fact might somehow 
be related to the problem.  However, trademark (U+2122) is in that area 
(0x99) while copyright (U+00A9) is not (0xA9), yet you say that both 
have the problem...


-Stephan

-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org



Re: [dev] Save As vs. storeAsURL() text export filter difference

2009-11-05 Thread David Lu


Hi T. J.,

I already tried UTF8 (based on some Google searches) and it
actually performed worse, as it did not translate any non-ASCII
characters at all.  Characters such as the left and right double
quotes were changed to ?.

I stumbled upon UTF-8 after noticing in the GUI that the
character code encoding says UTF-8 instead of UTF8.  It
worked better, preserving the left and right double quotes,
but did not preserve the trademark and copyright symbols.
Whereas when doing the Save As from the GUI, all of those
characters are preserved.

Based on your suggestion, I also tried various fonts and they
don't seem to make any difference.

I'll give the d...@api.openoffice.org list a try.  Thank you!

  - David -

T. J. Frazier wrote:


Try UTF8 instead of UTF-8.

Working in Basic, I hit a similar problem. After a successful 
load/store, the file itself (I looked with the IDE) had these 
interesting strings, which I copied:


aArgs(2).Name = FilterName
aArgs(2).Value = Text (encoded)
aArgs(3).Name = FilterOptions
aArgs(3).Value = UTF8,CRLF,Times New Roman,en-US,

Otherwise, you might have more luck posting your question on the 
d...@api.openoffice.org list.


HTH


-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org