On Tue, 2006-02-14 at 23:51 +0100, Fred Vos wrote:
> Hello,
> 
> Last september I started a course in Arabic language. Now I want to present
> Arabic texts on a website, using Cocoon. Normal Arabic text is supported in
> both Mozilla and Konqueror under Linux without a problem. Using entity encoded
> unicode strings like بحر , browsers will present Arabic
> characters Beh, Hah and Reh from right to left. No problem.
> 
> But for beginners the Arabic language supports vocals like the Fatha, Damma or
> Kasra, making it easier tho understand how one must pronounce the texts. You
> can find these signs in the Unicode table as combining characters. To present
> the above word as BaHRoenn (=sea), you can add combining characters Fatha,
> Sukun and Dammatan: بَحْرٌ
> 
> Try this under Mozilla or Konqueror and a strange thing happens: it is
> presented from left-to-right and gets unreadable, even for an Arab. Don't know
> if IE does this right.
> 
> The only renderer that seems to work here is Batik. If I enter the above text
> in an SVG file and convert it into a PNG file with the Batik rasterizer
> (command line interface), it is presented correctly, from right to left and
> with the combining characters.
> 
> Now my plan is as follows. I enter my texts including the combining characters
> in an XML file and transform these texts by removing the forbidden
> characters. I use the following XSL/XPath construct to remove the combining
> characters:
> 
> <xsl:for-each select="str:tokenize(string(@ar),
> '&#x064c;&#x064e;&#x064f;&#x0650;&#x0651;&#x0652;')">
>   <xsl:value-of select="." />
> </xsl:for-each>
> 
> (where @ar contains the string to convert)
> 
> This gives me Arab text without the vowels. Any browser will present this text
> nice from right to left. To present the text with vowels I
> want to convert the texts using an dynamically generated SVG file and the
> svg2png serializer.
> 
> For western texts, things are easy. Using a basic SVG file for the generator,
> I can transform this document with an XSL transformer, using the wildcard in
> the matcher as a parameter to the transformer. The transformer adds the
> parameter as text. This creates the SVG document including the text. Using the
> svg2png serializer, I can get a PNG document containing my dynamic text.
> 
> Unfortunately this doesn't work for Arabic text, even without the combining
> characters.
> 
> Here's the matcher in the sitemap:
> 
>       <map:match pattern="arab/artrans-*">
>         <map:generate type="file" src="style/artrans.svg"/>
>         <map:transform type="xslt" src="style/artranssvg.xsl">
>           <map:parameter name="text" value="{1}"/>
>         </map:transform>
>         <map:serialize type="svg2png"/>
>       </map:match>
> 
> If I try to use http://host:port/.../arab/artrans-<arab text for BaHRoenn
> without vowels is pasted here> in my browser (mozilla), the url is converted
> into http://host:port/.../arab/artrans-%D8%A8%D8%AD%D8%B1 and the picture
> contains rubbish text.
> 
> Does anyone here have any idea how I can successfully use the Batik rasterizer
> in the Cocoon environment for dynamically generating PNG or JPEG pictures with
> Arabic texts?

salamu habibi,
(the only arab I know)

If I understand correctly, the problem here is not that Batik works
differently when used inside Cocoon, but that the characters in the URL
are not decoded correctly.

I have had the same experience. While request parameters and post-bodies
are decoded correctly, the URL path itself is not.

This can be fixed though.

If you are running Jetty, supply the following parameter to the java
command line:
-Dorg.mortbay.util.URI.charset=UTF-8

If you are running Tomcat, you can do the same by editing
conf/server.xml, and on the Connector element (for http), add the
attribute URIEncoding="UTF-8".

Now, this will make that URL paths are correctly decoded as UTF-8.
However, this also means that request parameters will be decoded as
UTF-8, while Cocoon normally supposes the servlet container decodes them
as ISO-8859-1 and then corrects this itself.

The solution I have is to add a servlet filter which will set the
character encoding to UTF-8. Here's the source for such a filter:

package my;

import javax.servlet.*;
import java.io.IOException;

public class CharacterEncodingFilter implements Filter {
    private String encoding;

    public void init(FilterConfig filterConfig) throws ServletException {
        encoding = filterConfig.getInitParameter("encoding");
    }

    public void doFilter(ServletRequest servletRequest, ServletResponse 
servletResponse, FilterChain filterChain) throws IOException, ServletException {
        if (servletRequest.getCharacterEncoding() == null && this.encoding != 
null) {
            servletRequest.setCharacterEncoding(this.encoding);
        }
        filterChain.doFilter(servletRequest, servletResponse);
    }

    public void destroy() {
    }
}

Compile this, put it in a jara, put in in WEB-INF/lib. Edit the web.xml
file and add the following before the opening <servlet> element:

  <filter>
    <filter-name>encoding-filter</filter-name>
    <filter-class>my.CharacterEncodingFilter</filter-class>
    <init-param>
      <param-name>encoding</param-name>
      <param-value>UTF-8</param-value>
    </init-param>
  </filter>

  <filter-mapping>
    <filter-name>encoding-filter</filter-name>
    <url-pattern>/*</url-pattern>
  </filter-mapping>

In the same web.xml file, adjust both the form-encoding and
container-encoding parameters to be UTF-8 (these elements are already
there, don't add new ones):

    <init-param>
      <param-name>container-encoding</param-name>
      <param-value>UTF-8</param-value>
    </init-param>

    <init-param>
      <param-name>form-encoding</param-name>
      <param-value>UTF-8</param-value>
    </init-param>

(The container-encoding is now UTF-8 since the filter has instructed the
container to decode everything as UTF-8, while per default it will use
ISO-8859-1. This is needed because we otherwise can't destinguish
between the UTF-8 decoded URL and the ISO-8859-1 decoded post body)

And this should make everything working correctly.

BTW, I have found out all this only very recently and will take up the
discussion on the dev list to make this the default in Cocoon.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
[EMAIL PROTECTED]                          [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to