Re: [Agavi-Tickets] [Agavi] #478: Character escape sequences in CLDR data are not evaluated, single quotes possibly broken, too

Agavi Wed, 07 Mar 2007 20:51:31 -0800

#478: Character escape sequences in CLDR data are not evaluated, single quotes
possibly broken, too
-------------------------+--------------------------------------------------
 Reporter:  david        |        Owner:  dominik  
     Type:  defect       |       Status:  new      
 Priority:  high         |    Milestone:  0.11     
Component:  translation  |      Version:  0.11.0RC4
 Severity:  critical     |   Resolution:           
 Keywords:               |  
-------------------------+--------------------------------------------------
Old description:


> Example:
> source:/tags/0.11.0RC4/src/translation/data/locales/[EMAIL PROTECTED]
>
> Note the {{{\u00a0}}} - that's a non-breaking space
> (http://www.unicode.org/charts/PDF/U0080.pdf), it should be converted.
>
> Reproduce:
>  * configure and set locale to {{{ru_RU}}}
>  * format a date using {{{long}}} (here {{{date}}} only, for Nov 26,
> 2006).
>  * Expected result is: 26 ноября 2006 г.
>   * Yes, including the trailing dot
>   * The space between "2006" and "г" is a non-breaking space, unicode
> character no. x0A/160 (not ASCII character 160!)
>  * Actual result is: 26 ноября 2006\200600AM0г.'
>   * the trailing single quote indicates that single quotes are probably
> handled incorrectly, see http://unicode.org/reports/tr35/#Unicode_Sets
> (subsection E.1)
>   * the "2006" is from the lowercase "u" in the escape sequence
>   * the "AM" is from the lowercase "a" in the escape sequence
>
> Possible solution: grab the escape sequences (maybe best at compile
> time!) and convert them to XML entities, then convert them to UTF-8:
> {{{
> $seq = '00a0';
> html_entity_decode('&#x' . $seq . ';', ENT_QUOTES, 'utf-8');
> }}}
>
> The possible sequences are described in
> http://unicode.org/reports/tr35/#Unicode_Sets section E.2 (I don't think
> we can support {{{\N{name}}}} though
>
> And to make things even nicer:
>     Any character formed as the result of a backslash escape loses any
> special meaning and is treated as a literal. In particular, note that \u
> and \U escapes create literal characters. (In contrast, Java treats
> Unicode escapes as just a way to represent arbitrary characters in an
> ASCII source file, and any resulting characters are not tagged as
> literals.)
>
> My guess is that many other locales use escape sequences, not only in
> date (or other) patterns, so it's pretty important to fix this.
>
> I suggest replacing all {{{$node->getValue()}}} calls with
> {{{$this->_($node)}}} or something in AgaviLdmlConfigHandler, where
> {{{_()}}} calls {{{getValue()}}} on the node and then looks for escape
> sequences to eval.
>
> Note: Prado and Symfony use ICU's {{{.dat}}} files and thus are not
> affected. Zend Framework doesn't seem to handle the escape sequences at
> all.

New description:

 Example:
 source:/tags/0.11.0RC4/src/translation/data/locales/[EMAIL PROTECTED]

 Note the {{{\u00a0}}} - that's a non-breaking space
 (http://www.unicode.org/charts/PDF/U0080.pdf), it should be converted.

 Reproduce:
  * configure and set locale to {{{ru_RU}}}
  * format a date using {{{long}}} (here {{{date}}} only, for Nov 26,
 2006).
  * Expected result is: 26 ноября 2006 г.
   * Yes, including the trailing dot
   * The space between "2006" and "г" is a non-breaking space, unicode
 character no. x0A/160 (not ASCII character 160!)
  * Actual result is: 26 ноября 2006\200600AM0г.'
   * the trailing single quote indicates that single quotes are probably
 handled incorrectly, see http://unicode.org/reports/tr35/#Unicode_Sets
 (subsection E.1)
   * the "2006" is from the lowercase "u" in the escape sequence
   * the "AM" is from the lowercase "a" in the escape sequence

 Possible solution: grab the escape sequences (maybe best at compile time!)
 and convert them to XML entities, then convert them to UTF-8:
 {{{
 $seq = '00a0';
 html_entity_decode('&#x' . $seq . ';', ENT_QUOTES, 'utf-8');
 }}}

 The possible sequences are described in
 http://unicode.org/reports/tr35/#Unicode_Sets section E.2 (I don't think
 we can support {{{\N{name}}}} though)

 And to make things even nicer:
     Any character formed as the result of a backslash escape loses any
 special meaning and is treated as a literal. In particular, note that \u
 and \U escapes create literal characters. (In contrast, Java treats
 Unicode escapes as just a way to represent arbitrary characters in an
 ASCII source file, and any resulting characters are not tagged as
 literals.)
 That's not too much of an issue though, since special meanings should only
 occur in format strings. Therefor, we can evaluate the escape sequences,
 put them in single quotes, and let the formatters remove these single
 quotes again. That should be safe.

 My guess is that many other locales use escape sequences, not only in date
 (or other) patterns, so it's pretty important to fix this.

 I suggest replacing all {{{$node->getValue()}}} calls with
 {{{$this->_($node)}}} or something in AgaviLdmlConfigHandler, where
 {{{_()}}} calls {{{getValue()}}} on the node and then looks for escape
 sequences to eval.

 Note: Prado and Symfony use ICU's {{{.dat}}} files and thus are not
 affected. Zend Framework doesn't seem to handle the escape sequences at
 all.

-- 
Ticket URL: <http://trac.agavi.org/ticket/478#comment:1>
Agavi <http://www.agavi.org/>
An MVC Framework for PHP5


_______________________________________________
Agavi Tickets Mailing List
[email protected]
http://lists.agavi.org/mailman/listinfo/tickets

Re: [Agavi-Tickets] [Agavi] #478: Character escape sequences in CLDR data are not evaluated, single quotes possibly broken, too

Reply via email to