Re: Whitespace in \c[...], \x[...], etc.

2009-04-28 Thread Larry Wall
On Mon, Apr 27, 2009 at 11:04:03AM +0200, Helmut Wollmersdorfer wrote:
 It's not explicitly specified, if insignificant whitespace is allowed in  
 \c[...], \x[...], etc.

 Std.pm allows e.g.

   \x[   41  ,   42  ,  43  ]

 For convenience - especially with long charnames - it should be possible  
 to write

 \c[
 SPACE, # blafasel
 LATIN SMALL LETTER A,  # some comment
 COMBINING DOT BELOW,   # thisandthat
 ]

Does anyone know offhand whether the Unicode Consortium has an explicit
policy against use of punctuation in a charname?  So far they only
seem to use hyphen and parens, but I wonder to what extent we can
depend on that...

In any case, STD doesn't currently try to check the string in \c[...]
for correctness.  It just scans for the closing bracket.  We will
certainly need to refine this, and the suggested approach is certainly
a possible outcome, if we decide it's sufficiently unambiguous.

Larry


Re: Whitespace in \c[...], \x[...], etc.

2009-04-28 Thread Mark J. Reed
On Tue, Apr 28, 2009 at 10:22 AM, Larry Wall la...@wall.org wrote:
 Does anyone know offhand whether the Unicode Consortium has an explicit
 policy against use of punctuation in a charname?  So far they only
 seem to use hyphen and parens, but I wonder to what extent we can
 depend on that...


According to the 5.0.0 standard, section 4.8:

Unicode character names contain only uppercase Latin letters A
through Z, digits, space, and hyphen-minus.

So it seems the notes in parentheses are not considered part of the char name.

-- 
Mark J. Reed markjr...@gmail.com


Re: Whitespace in \c[...], \x[...], etc.

2009-04-28 Thread Patrick R. Michaud
On Tue, Apr 28, 2009 at 01:28:40PM -0400, Mark J. Reed wrote:
 On Tue, Apr 28, 2009 at 10:22 AM, Larry Wall la...@wall.org wrote:
  Does anyone know offhand whether the Unicode Consortium has an explicit
  policy against use of punctuation in a charname?  So far they only
  seem to use hyphen and parens, but I wonder to what extent we can
  depend on that...
 
 According to the 5.0.0 standard, section 4.8:
 
 Unicode character names contain only uppercase Latin letters A
 through Z, digits, space, and hyphen-minus.
 
 So it seems the notes in parentheses are not considered part of the char name.

Countering this, though:

* The XML schema for the Unicode Character Database in XML [1] 
  seems to allow parens in the character name property:

character-name = xsd:string { pattern=([A-Z0-9 #\-\(\)]*)|(control) } 

* The Unicode character name database [2] has parens in the
  name property field for many characters

000A;control;Cc;0;B;N;LINE FEED (LF)

* ICU doesn't seem to recognize the versions of the name without
  the parens (or if it does, I haven't been able to figure out the
  correct incantations to make it do so).

Of course, it's very possible that I'm misreading the Unicode
specifications, and the note that Mark cites would seem to be
very explicit.  But thus far in playing with this I've seen
more indications that the parens are allowed or even required
than I've seen that indicate they're excluded.

Pm

[1] http://www.unicode.org/reports/tr42/tr42-3.html#N66310
[2] http://unicode.org/Public/UNIDATA/UnicodeData.txt


Re: Whitespace in \c[...], \x[...], etc.

2009-04-28 Thread Patrick R. Michaud
On Tue, Apr 28, 2009 at 07:22:18AM -0700, Larry Wall wrote:
 On Mon, Apr 27, 2009 at 11:04:03AM +0200, Helmut Wollmersdorfer wrote:
  Std.pm allows e.g.
 
\x[   41  ,   42  ,  43  ]
 
  For convenience - especially with long charnames - it should be possible  
  to write
 
  \c[
  SPACE, # blafasel
  LATIN SMALL LETTER A,  # some comment
  COMBINING DOT BELOW,   # thisandthat
  ]
 
 In any case, STD doesn't currently try to check the string in \c[...]
 for correctness.  It just scans for the closing bracket.  We will
 certainly need to refine this, and the suggested approach is certainly
 a possible outcome, if we decide it's sufficiently unambiguous.

FWIW, Rakudo and PGE now allow spaces inside the brackets, although they 
don't understand the # ... comments yet.

Pm


Re: Whitespace in \c[...], \x[...], etc.

2009-04-28 Thread Mark J. Reed
On Tue, Apr 28, 2009 at 2:27 PM, Patrick R. Michaud pmich...@pobox.com wrote:
 According to the 5.0.0 standard, section 4.8:

 Unicode character names contain only uppercase Latin letters A
 through Z, digits, space, and hyphen-minus.

 So it seems the notes in parentheses are not considered part of the char 
 name.

 Countering this, though:

 * The XML schema for the Unicode Character Database in XML [1]
  seems to allow parens in the character name property:

    character-name = xsd:string { pattern=([A-Z0-9 #\-\(\)]*)|(control) }

Also '#', though I see no character names containing that symbol.

But all the parentheses I see in the list of character names are
surrounding lowercase letters, which are explicitly disallowed not
only in the spec I quoted, but in the XML scheme definition you quote
above.  e.g.

00C6 LATIN CAPITAL LETTER AE (ash)

 * The Unicode character name database [2] has parens in the
  name property field for many characters

    000A;control;Cc;0;B;N;LINE FEED (LF)

That's not the name property field.  The Unicode character name is
field 1 (control, in this case).  The field whose value is LINE
FEED (LF) is the Unicode_1_Name field, wihch for control characters
supplies the ISO 6429 name.

-- 
Mark J. Reed markjr...@gmail.com


Re: Whitespace in \c[...], \x[...], etc.

2009-04-28 Thread Patrick R. Michaud
On Tue, Apr 28, 2009 at 03:08:05PM -0400, Mark J. Reed wrote:
 On Tue, Apr 28, 2009 at 2:27 PM, Patrick R. Michaud pmich...@pobox.com 
 wrote:
  * The Unicode character name database [2] has parens in the
   name property field for many characters
 
     000A;control;Cc;0;B;N;LINE FEED (LF)
 
 That's not the name property field.  The Unicode character name is
 field 1 (control, in this case).  The field whose value is LINE
 FEED (LF) is the Unicode_1_Name field, wihch for control characters
 supplies the ISO 6429 name.

Ah, thanks for the excellent clarification.

Returning to the original question:  Would this then mean
that we don't provide a way to specify U+000A and other control
characters using a name inside of \c[...]?

Or (more likely) does it mean that the names we accept inside
of the \c[...] are more than just the strict
Unicode character names listed above--i.e., the Unicode_1_Name
field and other related aliases (whatever those might be)?

Pm