[Templates] Patch: Full unicode support for TT under 5.8

Mark Fowler Sun, 27 Jun 2004 06:51:23 -0700

Hello List.

Yes, I've not been reading this list for a long time now.  Bad me.
Doesn't mean I've not been working at TT though...


Attached is a patch (and a test which tests the patch) that allows
the Template Toolkit to work properly with Unicode in perl 5.8.  This
means that:

 a) As long as you put a BOM at the start of the document, you can now
    have proper Unicode templates.  Using the correct BOM means that your
    templates can be encoded in any of UTF-8, UTF-16 (either byte order)
    or UTF-32 (either byte order) and TT will now Do The Right Thing when
    it comes to automatically decoding your input.

 b) As a consequence of this your Unicode templates are now truly
    Unicode, not just a string of bytes.  Things like detecting the length
    of the string will work.  Concatenating the result of the template
    with a Perl string that contains chars higher than 255 will now
    work properly (previously as Perl was naive that template actually
    already contained the byte sequences for Unicode it would promote
    each byte in the template to it's corresponding Latin-1 char rather
    than amalgamating each sequence into one char as it should have.)

 c) No matter how you got Unicode into your complied template (be it from
    the original template file or from a utf8 flagged constant from
    NAMESPACES) cached templates that contain chars over 255 will be
    written to the disk in utf8 and the 'utf8' pragma will be prepended.
    This stops the annoying problem that the first apache process would
    load and continue to process utf8 data correctly but all other
    processes would load the cached template from disk and incorrectly
    assume that the bytes in it were Latin-1 chars rather than utf8 byte
    sequences and render the output badly.

Can someone please check this over for me?  It's now early Sunday
afternoon and I was playing around with this in the small hours of the
morning (pesky jet-lag) so there may be a bucket-load of errors in here.

In particular, could someone still running Perl 5.6 and Perl 5.005 in
anger check that this patch doesn't cause problems for them?  And if
anyone's got any nice juicy real world benchmarks, I'd love to see how
much in practice this slows down someone using just Latin-1 templates (it
shouldn't do too much - it just has to check that there's no BOM on the
front)

Mark.

-- 
#!/usr/bin/perl -T
use strict;
use warnings;
print q{Mark Fowler, [EMAIL PROTECTED], http://twoshortplanks.com/};

Index: lib/Template/Provider.pm
===================================================================
RCS file: /template-toolkit/Template2/lib/Template/Provider.pm,v
retrieving revision 2.80
diff -u -r2.80 Provider.pm
--- lib/Template/Provider.pm    2004/01/30 19:32:28     2.80
+++ lib/Template/Provider.pm    2004/06/27 13:10:53
@@ -628,6 +628,7 @@
         elsif (ref $name) {
             # ...or a GLOB or file handle...
             my $text = <$name>;
+            $text = $self->_decode($text) if $] > 5.007;
             $data = {
                 name => defined $alias ? $alias : 'input file handle',
                 text => $text,
@@ -638,6 +639,7 @@
         elsif (-f $name) {
             if (open(FH, $name)) {
                 my $text = <FH>;
+                $text = $self->_decode($text) if $] > 5.007;
                 $data = {
                     name => $alias,
                     path => $name,
@@ -966,6 +968,57 @@
        }
     }
 }
+
+#------------------------------------------------------------------------
+# _decode
+#
+# Decodes encoded unicode text that starts with a BOM and
+# turns it into perl's internal representation
+#------------------------------------------------------------------------
+
+my $boms = [
+ 'UTF-8'    => "\x{ef}\x{bb}\x{bf}",
+ 'UTF-32BE' => "\x{0}\x{0}\x{fe}\x{ff}",
+ 'UTF-32LE' => "\x{ff}\x{fe}\x{0}\x{0}",
+ 'UTF-16BE' => "\x{fe}\x{ff}",
+ 'UTF-16LE' => "\x{ff}\x{fe}",
+];
+
+# hack so that 'use bytes' will compile on perls earlier than 5.6
+# even though _decode is never called on those systems
+BEGIN { if ($] < 5.006) { package bytes; $INC{'bytes.pm'} = 1; } }
+
+sub _decode
+{
+  use bytes;
+
+  my $self   = shift;
+  my $string = shift;
+
+  # try all the BOMs in order looking for one (order is important
+  # 32bit BOMs look like 16bit BOMs)
+  my $count = 0;
+  while ($count < @{ $boms })
+  {
+    my $enc = $boms->[$count];
+    my $bom = $boms->[$count+1];
+
+    # does the string start with the bom?
+    if ($bom eq substr($string, 0, length($bom)))
+    {
+      # decode it and hand it back
+      require Encode;
+      return Encode::decode($enc, substr($string, length($bom)), 1);
+    }
+
+    $count += 2;
+  }
+
+  # no boms matched, must be a non unicode string
+  # just return it as it is
+  return $string;
+}
+
 
 1;
 
Index: lib/Template/Document.pm
===================================================================
RCS file: /template-toolkit/Template2/lib/Template/Document.pm,v
retrieving revision 2.72
diff -u -r2.72 Document.pm
--- lib/Template/Document.pm    2004/01/30 19:32:25     2.72
+++ lib/Template/Document.pm    2004/06/27 13:10:54
@@ -280,7 +280,12 @@
         ($fh, $tmpfile) = File::Temp::tempfile( 
             DIR => File::Basename::dirname($file) 
         );
-       print $fh $class->as_perl($content) || die $!;
+       my $perlcode = $class->as_perl($content) || die $!;
+        if ($] > 5.007 && utf8::is_utf8($perlcode)) {
+          $perlcode = "use utf8;\n\n$perlcode";
+          binmode $fh, ":utf8";
+        }
+       print $fh $perlcode;
        close($fh);
     };
     return $class->error($@) if $@;

unicode.t
Description: Troff document

[Templates] Patch: Full unicode support for TT under 5.8

Reply via email to