Hi Geir Magnusson Jr.,

Ya, the test works, but it doesn't mean there's no bug.  The bug is nothing about 
UTF-8 or other encodings - this is the java.io.InputStreamReader's business.  The bug 
is in JavaCC's ASCII_CharStream, it masks the higher byte of any UNICODE.  The detail 
is:

* velocity gets the file stream from its template file, and wraps it with an 
InputStreamReader with the given encoding method(specified in velocity.properties: 
input.encoding=XXX)

* then creates an ASCII_CharStream with the reader as a parameter.

* creates an javacc Parser, with the ASCII_CharStream as a parameter.

* Parser parse the character stream read from ASCII_CharStream, then in turn 
InputStreamReader and FileInputStream.

The InputStreamReader gets the correct character stream from FileInputStream, but 
ASCII_CharStream masks the higher byte of characters read from InputStreamReader!

I modified the encodingtest.vm in testbed, which includes Chinese characters (means 
"surfing in net").  File is also encoded in UTF-8.  And here's corresponding 
encodingtest.cmp.

Test files and solutions are attached.

Best regards,
Michael Zhou

---- You wrote at 2001-05-21 23:05:00 ----
From: Geir Magnusson Jr. <[EMAIL PROTECTED]>
To: velocity-dev <[EMAIL PROTECTED]>
Subject: Re: Velocity v1.1-rc1 released

The testbed tests Chinese characters, using UTF-8.  Does that not work?

I thought UTF-8 had bigger tha 8 bit characters.... ?

geir


Michael Zhou wrote:
> 
> Hi guys,
> 
> I'm pleased to see velocity 1.1-rc1 released.  It can specify template-based 
>encoding method, which is very useful to internatinal applications.  Unfortunitely, 
>there's still a bug(i think, maybe it's not velocity's bug, it's about JavaCC).  How 
>can velocity process international characters such as Chinese?  I modified the 
>input.encoding=GBK property in velocity.properties file, to make it recogonize 
>Chinese.  It seems work well, until it saw a Chinese character (U+4e0a).  Velocity 
>complained it encountered a "\n" after double-quote.  Ya, it's velocity(javacc) masks 
>the higher byte, so it considered (U+4e0a) as (U+000a), which is same as "\n".
> 
> I tried to correct this by adding a new line "UNICODE_INPUT=true" in 
>"org/apache/velocity/runtime/parser/Parser.jjt" file, and rebuild the velocity.  So 
>that JavaCC generates "UCode_CharStream.java" instead of "ASCII_CharStream.java".  
>But the result becomes more strange!  Because javacc eats every 2 characters and 
>combines them as one character regardless of you initialize the parser by a 
>byte-based stream or character-based reader!  Finally, I found the solution.
> 
> * First, add option "UNICODE_INPUT=true" to Parser.jjt, and execute the "build" 
>shell in "/org/apache/velocity/runtime/parser" directory.  Then it will generate a 
>file UCode_CharStream.java.
> 
> * Replace the UCode_CharStream.java with original ASCII_CharStream.java by shell 
>command: mv ASCII_CharStream.java UCode_CharStream.java
> 
> * Modify the UCode_CharStream.java by vi UCode_CharStream.java
>    1.   replace the class name and constructor names with UCode_CharStream.
>    2.   modify lines below in the method readChar(), so that it can process UNICODE 
>correctly.
>         change:  return (char)((char)0xff & buffer[(bufpos == bufsize - 1) ? (bufpos 
>= 0) : ++bufpos]);
>                 to:  return buffer[(bufpos == bufsize - 1) ? (bufpos = 0) : 
>++bufpos];
>         change:  char c = (char)((char)0xff & buffer[bufpos]);
>             to:  char c = buffer[bufpos];
> 
> * Use of "USER_CHAR_STREAM=true" in javacc is also work.
> 
> Now i think the product is perfect!
> 
> Michael Zhou

-- 
Geir Magnusson Jr.                           [EMAIL PROTECTED]
System and Software Consulting
Developing for the web?  See http://jakarta.apache.org/velocity/
"still climbing up to the shoulders..."

encodingtest.cmp

encodingtest.vm

UCode_CharStream.java

Parser.jjt

Reply via email to