You are correct on all counts, with possibly one exception.  In my testing,
the output element with the method attribute set to text but without the
encoding attribute emits the two character output you describe in para 2
below.
I have seen the Lite.  It's ice cold too.  :-)
thanks much,
Matthew L. Avizinis <mailto:[EMAIL PROTECTED]>


| -----Original Message-----
| From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
| Sent: Monday, July 09, 2001 3:08 PM
| To: Matthew L. Avizinis
| Cc: [EMAIL PROTECTED]
| Subject: RE: [GUMP] Build Failure - Fop
|
|
|
| The small test I set up for this is (Working on the latest Java version):
|
| <?xml version="1.0"?>
| <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; version
| ="1.0">
|   <xsl:output method="xml"/>
|
| <xsl:template match="/"><out>&#0176;</out></xsl:template>
|
| </xsl:stylesheet>
|
| &#0176; is, I think, the degree symbol.  If you set the output to html the
| result is &deg;.  Does this test seem like it replicates your case?  It
| doesn't make a difference for our purposes (at least on the Java version)
| if you set the method to text or use -text.
|
| I get the characters 0xC2 (194) and 0xB0 (176), which I think is the
| correct pair for UTF-8.  I believe this is the same thing you are seeing.
| According to "Unicode, A Primer", by Tony Graham (which I highly
| recommend), 0x0080 to 0x07FF is represented by the Unicode bit pattern
| 00000yyyyyxxxxxx.  The code is translated to UTF bytes as 110yyyyy and
| 10xxxxxx.  So decimal 176, which is represented by the bits 00000 00010
| 110000, means yyyyy=00010  and xxxxxx=110000.  So  You get 11000010 and
| 10110000, i.e. 0xC2 and 0xB0.  The fact that byte2 is 0xB0 which
| == &#0176;
| is either by accident or a UTF-8 design feature... I'm not sure which.
|
| If you use <xsl:output method="xml" encoding="US-ASCII"/>, you will get
| <out>&#176;</out>.  If you use <xsl:output method="xml" encoding
| ="ISO-8859-1"/> you will get the single character 0xB0, which I think is
| maybe what you want.
|
| <xsl:output method="text" encoding="US-ASCII"/> you will get
| &#176;.  This,
| I think, should be considered a bug... we should probably throw an
| exception instead, since character entities are meaningless in pure text.
|
| -scott
|

Reply via email to