RE: FOP to SVG not including character references in conversion.

2005-01-18 Thread Andreas L. Delmelle
> -Original Message-
> From: Dave Austin [mailto:[EMAIL PROTECTED]
>

Hi,

> I am generating FO files from RTF files with a handy utility called
> RTF2FO.  I set the character encoding to ISO-8859-1 for linux.
>
> I have several character references in the FO file. For instance,
> • is embedded as the bullet. However, the resultant SVG from the
> FO Processor does not include those references, just ?'s.

A character corresponding to Character Reference • (or, as some like
it better: •) is not provided in the ISO-8859-1 charset. By itself,
this doesn't pose a problem --that's precisely what CR's are there for.
The problem lies most probably in the fact that the conversion is from an
ISO-8859-1 encoded FO to an ISO-8859-1 encoded SVG, and that somewhere in
between, the implementation converts the CR to a differently encoded
character...
Hmm... so:
a sequence of ISO-8859-1 characters ('&','#','8','2','2','6',';') is
recognized as a numeric CR, converted to a java.lang.String --always Unicode
characters, so the sequence is converted into '\u2022'-- and, ideally it
should go back again to a sequence of ISO-8859-1 chars representing the CR,
but this doesn't happen.

following the trail:
fop.renderer.svg.SVGRenderer.renderWordArea( area )
  String s = area.getText();
  ...
  Element text = SVGUtilities.createText( ..., s )

fop.svg.SVGUtilities.createText( ..., s )
  org.w3c.dom.Text text = doc.createTextNode(s);

and, interestingly enough, in
fop.renderer.svg.SVGRenderer.startRenderer( OutputStream )
  the ISO-8859-1 encoding is hardcoded.

Not sure why, and up to here it doesn't really make a difference, but in
stopRenderer() the OutputStream *is* referenced, and wrapped inside an
OutputStreamWriter without explicit charset --default platform charset,
which happens to be the same ISO-8859-1 in this case.

>From ISO-8859-1 to Unicode characters to an OSWriter outputting bytes
according to the default charset of your platform, starting off by
outputting an XML declaration with an encoding that may (or may not)
indicate otherwise... and, last but not least, if you're not bored to death
yet :-P
The question-mark has its roots in java.nio.charset.CharsetEncoder.
(see:
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/CharsetEncoder.html
)
The Unicode character '\u2022' doesn't have a corresponding ISO-8859-1
character, so the default value of '?' is never changed.

OK, now that you have the full story...

If you know your way around in Java, for a quick-fix, you can change the
hardcoded encoding declaration to "UTF-8" or remove it, and change the
OSWriter's encoding to "UTF-8", then recompile. The output won't contain the
Character Reference, but at least the bytes would be correctly written and
interpreted.

HTH!

Greetz,

Andreas


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FOP to SVG not including character references in conversion.

2005-01-18 Thread J.Pietschmann
Dave Austin wrote:
I am generating FO files from RTF files with a handy utility called
RTF2FO.  I set the character encoding to ISO-8859-1 for linux.
I have several character references in the FO file. For instance,
• is embedded as the bullet. However, the resultant SVG from the
FO Processor does not include those references, just ?'s.
It's probably similar to this FAQ:
 http://xml.apache.org/fop/faq.html#pdf-characters
with the added twist that RTF was invented before Unicode
and has no capability to use multibyte character sets, or
even to store character set information.
You can't do much about this (except avoiding non-ISO-8859-1
characters).
J.Pietschmann
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]