#478: Character escape sequences in CLDR data are not evaluated, single quotes
possibly broken, too
-------------------------+--------------------------------------------------
Reporter: david | Owner: dominik
Type: defect | Status: new
Priority: high | Milestone: 0.11
Component: translation | Version: 0.11.0RC4
Severity: critical | Resolution:
Keywords: |
-------------------------+--------------------------------------------------
Old description:
> Example:
> source:/tags/0.11.0RC4/src/translation/data/locales/[EMAIL PROTECTED]
>
> Note the {{{\u00a0}}} - that's a non-breaking space
> (http://www.unicode.org/charts/PDF/U0080.pdf), it should be converted.
>
> Reproduce:
> * configure and set locale to {{{ru_RU}}}
> * format a date using {{{long}}} (here {{{date}}} only, for Nov 26,
> 2006).
> * Expected result is: 26 ноября 2006 г.
> * Yes, including the trailing dot
> * The space between "2006" and "г" is a non-breaking space, unicode
> character no. x0A/160 (not ASCII character 160!)
> * Actual result is: 26 ноября 2006\200600AM0г.'
> * the trailing single quote indicates that single quotes are probably
> handled incorrectly, see http://unicode.org/reports/tr35/#Unicode_Sets
> (subsection E.1)
> * the "2006" is from the lowercase "u" in the escape sequence
> * the "AM" is from the lowercase "a" in the escape sequence
>
> Possible solution: grab the escape sequences (maybe best at compile
> time!) and convert them to XML entities, then convert them to UTF-8:
> {{{
> $seq = '00a0';
> html_entity_decode('&#x' . $seq . ';', ENT_QUOTES, 'utf-8');
> }}}
>
> The possible sequences are described in
> http://unicode.org/reports/tr35/#Unicode_Sets section E.2 (I don't think
> we can support {{{\N{name}}}} though
>
> And to make things even nicer:
> Any character formed as the result of a backslash escape loses any
> special meaning and is treated as a literal. In particular, note that \u
> and \U escapes create literal characters. (In contrast, Java treats
> Unicode escapes as just a way to represent arbitrary characters in an
> ASCII source file, and any resulting characters are not tagged as
> literals.)
>
> My guess is that many other locales use escape sequences, not only in
> date (or other) patterns, so it's pretty important to fix this.
>
> I suggest replacing all {{{$node->getValue()}}} calls with
> {{{$this->_($node)}}} or something in AgaviLdmlConfigHandler, where
> {{{_()}}} calls {{{getValue()}}} on the node and then looks for escape
> sequences to eval.
>
> Note: Prado and Symfony use ICU's {{{.dat}}} files and thus are not
> affected. Zend Framework doesn't seem to handle the escape sequences at
> all.
New description:
Example:
source:/tags/0.11.0RC4/src/translation/data/locales/[EMAIL PROTECTED]
Note the {{{\u00a0}}} - that's a non-breaking space
(http://www.unicode.org/charts/PDF/U0080.pdf), it should be converted.
Reproduce:
* configure and set locale to {{{ru_RU}}}
* format a date using {{{long}}} (here {{{date}}} only, for Nov 26,
2006).
* Expected result is: 26 ноября 2006 г.
* Yes, including the trailing dot
* The space between "2006" and "г" is a non-breaking space, unicode
character no. x0A/160 (not ASCII character 160!)
* Actual result is: 26 ноября 2006\200600AM0г.'
* the trailing single quote indicates that single quotes are probably
handled incorrectly, see http://unicode.org/reports/tr35/#Unicode_Sets
(subsection E.1)
* the "2006" is from the lowercase "u" in the escape sequence
* the "AM" is from the lowercase "a" in the escape sequence
Possible solution: grab the escape sequences (maybe best at compile time!)
and convert them to XML entities, then convert them to UTF-8:
{{{
$seq = '00a0';
html_entity_decode('&#x' . $seq . ';', ENT_QUOTES, 'utf-8');
}}}
The possible sequences are described in
http://unicode.org/reports/tr35/#Unicode_Sets section E.2 (I don't think
we can support {{{\N{name}}}} though)
And to make things even nicer:
Any character formed as the result of a backslash escape loses any
special meaning and is treated as a literal. In particular, note that \u
and \U escapes create literal characters. (In contrast, Java treats
Unicode escapes as just a way to represent arbitrary characters in an
ASCII source file, and any resulting characters are not tagged as
literals.)
That's not too much of an issue though, since special meanings should only
occur in format strings. Therefor, we can evaluate the escape sequences,
put them in single quotes, and let the formatters remove these single
quotes again. That should be safe.
My guess is that many other locales use escape sequences, not only in date
(or other) patterns, so it's pretty important to fix this.
I suggest replacing all {{{$node->getValue()}}} calls with
{{{$this->_($node)}}} or something in AgaviLdmlConfigHandler, where
{{{_()}}} calls {{{getValue()}}} on the node and then looks for escape
sequences to eval.
Note: Prado and Symfony use ICU's {{{.dat}}} files and thus are not
affected. Zend Framework doesn't seem to handle the escape sequences at
all.
--
Ticket URL: <http://trac.agavi.org/ticket/478#comment:1>
Agavi <http://www.agavi.org/>
An MVC Framework for PHP5
_______________________________________________
Agavi Tickets Mailing List
[email protected]
http://lists.agavi.org/mailman/listinfo/tickets