Hi everyone. Remember that problem with non-latin-1 characters being painfully distorted when added to a UTF-8 encoded string (in perl 5.6.x)? Although slightly offtopic, we discussed it here in November last year under subject "UTF8 support and issues". Then a magic workaround was proposed:
$var = pack( 'U*', unpack( 'U*', $var ) ); If applied to all template variables, it was supposed to make template output clean. Now I ran into this problem again. Let me start from the beginning. I use perl 5.6.1, XML::XPath and Template-Toolkit (2.07) to read in an XML file (content) and write out a bunch of HTML files (in presentation). All XML-parsed data in Perl is correctly in UTF-8 encoding. That is perfectly well with me: I write HTML in UTF-8. My XML file has non-latin-language content (russian). To write out correct UTF8, I use mentioned above pack/unpack workaround hack to make my data clean. This helped well up until recently. Then I introduced some russian words into one of my templates. Now that got f##ked-up when output by Template-Toolkit. Before that there were only pure latin characters in the templates and it worked fine. As you may guess, the templates are in UTF8, although this doesn't make much difference. There is text in a template and it gets output in a distorted way. To make my point clear: 1- this only happens if you output it together with some internally-marked UTF8 data, like e.g. XML-originating data; 2- it is a Perl's bug, not TT's. To make my point even clearer, I attach a bug-reproducing-demo script with one small file it needs. (Although in addition to TT, it also needs XML::XPath to run.) What it does is tries to output the same non-latin UTF8 string, read from XML and then read from a template file. It tries it in several different ways, showing you the result. So I had no other option, but to dig in the TT sources a little bit to add that funny pack/unpack hack there, where it might help. After a number of attempts, I sticked it into certain points of Template::Provider and Template::Directive. That's exactly two lines changed and two lines added. Now my XML-to-HTML thing works fine, as long as I put changed versions of these modules in PERLLIB before the original ones. So, the question: Should this fix be included into official TT or should it stay as it is -- a hack? Or may be we can and should make it optional? I don't know. But I certainly think there should be an official TT way to workaround this perl's problem, 'cause it is serious. It's not specific to russian language, it will happen everytime when there is some XML data and some non-latin data. I did simplest possible timing tests. At least on my system on my program (which does quite a bit of template processing, but not only that) it doesn't show any significant performance hit. But if we do make it official, then we probably need to do the same thing to every variable included in TT output as well. (I mean the original problem discussed in November). So this probably needs to be a more general solution. What do you think? BTW, is this problem fixed in perl 5.8? Cheers, Kurmanov
utf8bugdemo.pl
Description: Perl program
п?я─п╦п╡п╣я┌
