On Tue, 2006-02-14 at 23:51 +0100, Fred Vos wrote:
> Hello,
>
> Last september I started a course in Arabic language. Now I want to present
> Arabic texts on a website, using Cocoon. Normal Arabic text is supported in
> both Mozilla and Konqueror under Linux without a problem. Using entity encoded
> unicode strings like بحر , browsers will present Arabic
> characters Beh, Hah and Reh from right to left. No problem.
>
> But for beginners the Arabic language supports vocals like the Fatha, Damma or
> Kasra, making it easier tho understand how one must pronounce the texts. You
> can find these signs in the Unicode table as combining characters. To present
> the above word as BaHRoenn (=sea), you can add combining characters Fatha,
> Sukun and Dammatan: بَحْرٌ
>
> Try this under Mozilla or Konqueror and a strange thing happens: it is
> presented from left-to-right and gets unreadable, even for an Arab. Don't know
> if IE does this right.
>
> The only renderer that seems to work here is Batik. If I enter the above text
> in an SVG file and convert it into a PNG file with the Batik rasterizer
> (command line interface), it is presented correctly, from right to left and
> with the combining characters.
>
> Now my plan is as follows. I enter my texts including the combining characters
> in an XML file and transform these texts by removing the forbidden
> characters. I use the following XSL/XPath construct to remove the combining
> characters:
>
> <xsl:for-each select="str:tokenize(string(@ar),
> 'ٌَُِّْ')">
> <xsl:value-of select="." />
> </xsl:for-each>
>
> (where @ar contains the string to convert)
>
> This gives me Arab text without the vowels. Any browser will present this text
> nice from right to left. To present the text with vowels I
> want to convert the texts using an dynamically generated SVG file and the
> svg2png serializer.
>
> For western texts, things are easy. Using a basic SVG file for the generator,
> I can transform this document with an XSL transformer, using the wildcard in
> the matcher as a parameter to the transformer. The transformer adds the
> parameter as text. This creates the SVG document including the text. Using the
> svg2png serializer, I can get a PNG document containing my dynamic text.
>
> Unfortunately this doesn't work for Arabic text, even without the combining
> characters.
>
> Here's the matcher in the sitemap:
>
> <map:match pattern="arab/artrans-*">
> <map:generate type="file" src="style/artrans.svg"/>
> <map:transform type="xslt" src="style/artranssvg.xsl">
> <map:parameter name="text" value="{1}"/>
> </map:transform>
> <map:serialize type="svg2png"/>
> </map:match>
>
> If I try to use http://host:port/.../arab/artrans-<arab text for BaHRoenn
> without vowels is pasted here> in my browser (mozilla), the url is converted
> into http://host:port/.../arab/artrans-%D8%A8%D8%AD%D8%B1 and the picture
> contains rubbish text.
>
> Does anyone here have any idea how I can successfully use the Batik rasterizer
> in the Cocoon environment for dynamically generating PNG or JPEG pictures with
> Arabic texts?
salamu habibi,
(the only arab I know)
If I understand correctly, the problem here is not that Batik works
differently when used inside Cocoon, but that the characters in the URL
are not decoded correctly.
I have had the same experience. While request parameters and post-bodies
are decoded correctly, the URL path itself is not.
This can be fixed though.
If you are running Jetty, supply the following parameter to the java
command line:
-Dorg.mortbay.util.URI.charset=UTF-8
If you are running Tomcat, you can do the same by editing
conf/server.xml, and on the Connector element (for http), add the
attribute URIEncoding="UTF-8".
Now, this will make that URL paths are correctly decoded as UTF-8.
However, this also means that request parameters will be decoded as
UTF-8, while Cocoon normally supposes the servlet container decodes them
as ISO-8859-1 and then corrects this itself.
The solution I have is to add a servlet filter which will set the
character encoding to UTF-8. Here's the source for such a filter:
package my;
import javax.servlet.*;
import java.io.IOException;
public class CharacterEncodingFilter implements Filter {
private String encoding;
public void init(FilterConfig filterConfig) throws ServletException {
encoding = filterConfig.getInitParameter("encoding");
}
public void doFilter(ServletRequest servletRequest, ServletResponse
servletResponse, FilterChain filterChain) throws IOException, ServletException {
if (servletRequest.getCharacterEncoding() == null && this.encoding !=
null) {
servletRequest.setCharacterEncoding(this.encoding);
}
filterChain.doFilter(servletRequest, servletResponse);
}
public void destroy() {
}
}
Compile this, put it in a jara, put in in WEB-INF/lib. Edit the web.xml
file and add the following before the opening <servlet> element:
<filter>
<filter-name>encoding-filter</filter-name>
<filter-class>my.CharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>encoding-filter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
In the same web.xml file, adjust both the form-encoding and
container-encoding parameters to be UTF-8 (these elements are already
there, don't add new ones):
<init-param>
<param-name>container-encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
<init-param>
<param-name>form-encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
(The container-encoding is now UTF-8 since the filter has instructed the
container to decode everything as UTF-8, while per default it will use
ISO-8859-1. This is needed because we otherwise can't destinguish
between the UTF-8 decoded URL and the ISO-8859-1 decoded post body)
And this should make everything working correctly.
BTW, I have found out all this only very recently and will take up the
discussion on the dev list to make this the default in Cocoon.
--
Bruno Dumon http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
[EMAIL PROTECTED] [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]