Re: Whitespace in \c[...], \x[...], etc.
On Mon, Apr 27, 2009 at 11:04:03AM +0200, Helmut Wollmersdorfer wrote: It's not explicitly specified, if insignificant whitespace is allowed in \c[...], \x[...], etc. Std.pm allows e.g. \x[ 41 , 42 , 43 ] For convenience - especially with long charnames - it should be possible to write \c[ SPACE, # blafasel LATIN SMALL LETTER A, # some comment COMBINING DOT BELOW, # thisandthat ] Does anyone know offhand whether the Unicode Consortium has an explicit policy against use of punctuation in a charname? So far they only seem to use hyphen and parens, but I wonder to what extent we can depend on that... In any case, STD doesn't currently try to check the string in \c[...] for correctness. It just scans for the closing bracket. We will certainly need to refine this, and the suggested approach is certainly a possible outcome, if we decide it's sufficiently unambiguous. Larry
Re: Whitespace in \c[...], \x[...], etc.
On Tue, Apr 28, 2009 at 10:22 AM, Larry Wall la...@wall.org wrote: Does anyone know offhand whether the Unicode Consortium has an explicit policy against use of punctuation in a charname? So far they only seem to use hyphen and parens, but I wonder to what extent we can depend on that... According to the 5.0.0 standard, section 4.8: Unicode character names contain only uppercase Latin letters A through Z, digits, space, and hyphen-minus. So it seems the notes in parentheses are not considered part of the char name. -- Mark J. Reed markjr...@gmail.com
Re: Whitespace in \c[...], \x[...], etc.
On Tue, Apr 28, 2009 at 01:28:40PM -0400, Mark J. Reed wrote: On Tue, Apr 28, 2009 at 10:22 AM, Larry Wall la...@wall.org wrote: Does anyone know offhand whether the Unicode Consortium has an explicit policy against use of punctuation in a charname? So far they only seem to use hyphen and parens, but I wonder to what extent we can depend on that... According to the 5.0.0 standard, section 4.8: Unicode character names contain only uppercase Latin letters A through Z, digits, space, and hyphen-minus. So it seems the notes in parentheses are not considered part of the char name. Countering this, though: * The XML schema for the Unicode Character Database in XML [1] seems to allow parens in the character name property: character-name = xsd:string { pattern=([A-Z0-9 #\-\(\)]*)|(control) } * The Unicode character name database [2] has parens in the name property field for many characters 000A;control;Cc;0;B;N;LINE FEED (LF) * ICU doesn't seem to recognize the versions of the name without the parens (or if it does, I haven't been able to figure out the correct incantations to make it do so). Of course, it's very possible that I'm misreading the Unicode specifications, and the note that Mark cites would seem to be very explicit. But thus far in playing with this I've seen more indications that the parens are allowed or even required than I've seen that indicate they're excluded. Pm [1] http://www.unicode.org/reports/tr42/tr42-3.html#N66310 [2] http://unicode.org/Public/UNIDATA/UnicodeData.txt
Re: Whitespace in \c[...], \x[...], etc.
On Tue, Apr 28, 2009 at 07:22:18AM -0700, Larry Wall wrote: On Mon, Apr 27, 2009 at 11:04:03AM +0200, Helmut Wollmersdorfer wrote: Std.pm allows e.g. \x[ 41 , 42 , 43 ] For convenience - especially with long charnames - it should be possible to write \c[ SPACE, # blafasel LATIN SMALL LETTER A, # some comment COMBINING DOT BELOW, # thisandthat ] In any case, STD doesn't currently try to check the string in \c[...] for correctness. It just scans for the closing bracket. We will certainly need to refine this, and the suggested approach is certainly a possible outcome, if we decide it's sufficiently unambiguous. FWIW, Rakudo and PGE now allow spaces inside the brackets, although they don't understand the # ... comments yet. Pm
Re: Whitespace in \c[...], \x[...], etc.
On Tue, Apr 28, 2009 at 2:27 PM, Patrick R. Michaud pmich...@pobox.com wrote: According to the 5.0.0 standard, section 4.8: Unicode character names contain only uppercase Latin letters A through Z, digits, space, and hyphen-minus. So it seems the notes in parentheses are not considered part of the char name. Countering this, though: * The XML schema for the Unicode Character Database in XML [1] seems to allow parens in the character name property: character-name = xsd:string { pattern=([A-Z0-9 #\-\(\)]*)|(control) } Also '#', though I see no character names containing that symbol. But all the parentheses I see in the list of character names are surrounding lowercase letters, which are explicitly disallowed not only in the spec I quoted, but in the XML scheme definition you quote above. e.g. 00C6 LATIN CAPITAL LETTER AE (ash) * The Unicode character name database [2] has parens in the name property field for many characters 000A;control;Cc;0;B;N;LINE FEED (LF) That's not the name property field. The Unicode character name is field 1 (control, in this case). The field whose value is LINE FEED (LF) is the Unicode_1_Name field, wihch for control characters supplies the ISO 6429 name. -- Mark J. Reed markjr...@gmail.com
Re: Whitespace in \c[...], \x[...], etc.
On Tue, Apr 28, 2009 at 03:08:05PM -0400, Mark J. Reed wrote: On Tue, Apr 28, 2009 at 2:27 PM, Patrick R. Michaud pmich...@pobox.com wrote: * The Unicode character name database [2] has parens in the name property field for many characters 000A;control;Cc;0;B;N;LINE FEED (LF) That's not the name property field. The Unicode character name is field 1 (control, in this case). The field whose value is LINE FEED (LF) is the Unicode_1_Name field, wihch for control characters supplies the ISO 6429 name. Ah, thanks for the excellent clarification. Returning to the original question: Would this then mean that we don't provide a way to specify U+000A and other control characters using a name inside of \c[...]? Or (more likely) does it mean that the names we accept inside of the \c[...] are more than just the strict Unicode character names listed above--i.e., the Unicode_1_Name field and other related aliases (whatever those might be)? Pm