Hi Geir Magnusson Jr.,
(I have sent this mail earlier, but the sending seems not successful. Sorry if you
get this mail twice)
Ya, the test works, but it doesn't mean there's no bug. The bug is nothing about
UTF-8 or other encodings - this is the java.io.InputStreamReader's business. The bug
is in JavaCC's ASCII_CharStream, it masks the higher byte of any UNICODE. The detail
is:
* velocity gets the file stream from its template file, and wraps it with an
InputStreamReader with the given encoding method(specified in velocity.properties:
input.encoding=XXX)
* then creates an ASCII_CharStream with the reader as a parameter.
* creates an javacc Parser, with the ASCII_CharStream as a parameter.
* Parser parse the character stream read from ASCII_CharStream, then in turn
InputStreamReader and FileInputStream.
The InputStreamReader gets the correct character stream from FileInputStream, but
ASCII_CharStream masks the higher byte of characters read from InputStreamReader!
I modified the encodingtest.vm in testbed, which includes Chinese characters (means
"surfing in net"). File is also encoded in UTF-8. And here's corresponding
encodingtest.cmp.
Test files and solutions are attached.
Best regards,
Michael Zhou
---- You wrote at 2001-05-21 23:05:00 ----
From: Geir Magnusson Jr. <[EMAIL PROTECTED]>
To: velocity-dev <[EMAIL PROTECTED]>
Subject: Re: Velocity v1.1-rc1 released
The testbed tests Chinese characters, using UTF-8. Does that not work?
I thought UTF-8 had bigger tha 8 bit characters.... ?
geir
Michael Zhou wrote:
>
> Hi guys,
>
> I'm pleased to see velocity 1.1-rc1 released. It can specify template-based
>encoding method, which is very useful to internatinal applications. Unfortunitely,
>there's still a bug(i think, maybe it's not velocity's bug, it's about JavaCC). How
>can velocity process international characters such as Chinese? I modified the
>input.encoding=GBK property in velocity.properties file, to make it recogonize
>Chinese. It seems work well, until it saw a Chinese character (U+4e0a). Velocity
>complained it encountered a "\n" after double-quote. Ya, it's velocity(javacc) masks
>the higher byte, so it considered (U+4e0a) as (U+000a), which is same as "\n".
>
> I tried to correct this by adding a new line "UNICODE_INPUT=true" in
>"org/apache/velocity/runtime/parser/Parser.jjt" file, and rebuild the velocity. So
>that JavaCC generates "UCode_CharStream.java" instead of "ASCII_CharStream.java".
>But the result becomes more strange! Because javacc eats every 2 characters and
>combines them as one character regardless of you initialize the parser by a
>byte-based stream or character-based reader! Finally, I found the solution.
>
> * First, add option "UNICODE_INPUT=true" to Parser.jjt, and execute the "build"
>shell in "/org/apache/velocity/runtime/parser" directory. Then it will generate a
>file UCode_CharStream.java.
>
> * Replace the UCode_CharStream.java with original ASCII_CharStream.java by shell
>command: mv ASCII_CharStream.java UCode_CharStream.java
>
> * Modify the UCode_CharStream.java by vi UCode_CharStream.java
> 1. replace the class name and constructor names with UCode_CharStream.
> 2. modify lines below in the method readChar(), so that it can process UNICODE
>correctly.
> change: return (char)((char)0xff & buffer[(bufpos == bufsize - 1) ? (bufpos
>= 0) : ++bufpos]);
> to: return buffer[(bufpos == bufsize - 1) ? (bufpos = 0) :
>++bufpos];
> change: char c = (char)((char)0xff & buffer[bufpos]);
> to: char c = buffer[bufpos];
>
> * Use of "USER_CHAR_STREAM=true" in javacc is also work.
>
> Now i think the product is perfect!
>
> Michael Zhou
--
Geir Magnusson Jr. [EMAIL PROTECTED]
System and Software Consulting
Developing for the web? See http://jakarta.apache.org/velocity/
"still climbing up to the shoulders..."
encodingtest.cmp
encodingtest.vm
UCode_CharStream.java
Parser.jjt