RE: Terms for rotations

2014-11-10 Thread Whistler, Ken
> WIDDERSHINS is shorter then > COUNTERCLOCKWISE, but is not exactly a common term, especially in > technical English. Aye, but laddie, then we'd have to use DEASIL for CLOCKWISE! And we'd have wiccans after us to spell it "DEOSIL" instead. ;-) --Ken __

RE: Terms for rotations

2014-11-10 Thread Whistler, Ken
> Look at this picture: > http://www.permisecole.com/code-route/priorites/faux-carrefour-a-sens-giratoire.jpg > Imagine you sit in this car and you want to turn RIGHT. What will you > do? Will you turn the driving wheel clockwise or counterclockwise? And now imagine that you are motoring in a 1904

RE: Terms for rotations

2014-11-10 Thread Whistler, Ken
> > http://www.unicode.org/L2/L2012/12321-n4342-signwriting.pdf > > > > That should give you some ideas about possible alternative approaches > > for the material you are dealing with. > > > > --Ken > > Could the characters SWR2 to SWR8 be applied to chess symbols or should > new rotation modifier

RE: Terms for rotations

2014-11-07 Thread Whistler, Ken
Garth Wallace asked: > I'm currently working towards a proposal to encode a set of symbols > used in fairy chess and chess variants, and I have a question about > naming conventions. Several of the symbols are rotations of already > encoded symbols. ... > > It's even more unclear when it comes to

RE: fonts for U7.0 scripts

2014-10-24 Thread Whistler, Ken
Tom Gewecke wondered: > it seems that you would > need permission to copy the glyph. I wonder if that is necessary. To follow on from Peter Constable's response, it comes down to the actual scenario at hand and precisely what one means by "copy the glyph". Scenario 1 I want to use an example

RE: Code charts and code points

2014-10-24 Thread Whistler, Ken
> I think it is imaginable that someone wants to copy a block of > characters from the code charts, as a handy way of getting them for > inspection, e.g. for testing how some particular software renders them > using some particular font(s). I would expect some confusion then if you > had partly got

RE: Question about a Normalization test

2014-10-23 Thread Whistler, Ken
Aaron Cannon asked: > Hi all, from the latest version of the standard, on line 16977 of the > normalization tests, I am a bit confused by the NFC form. It appears > incorrect to me. Here's the line, sans comment: > > 0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE >

RE: Limits in UBA

2014-10-22 Thread Whistler, Ken
Eli, I think you are correct that the BidiCharacterTest.txt data currently does not go beyond 3 nesting levels for testing the BPA part of UBA. I agree with Andrew that that is reasonable guide to the normal limit of meaningful bracket embeddings one might find in text. However, I don't think it

RE: Limits in UBA

2014-10-22 Thread Whistler, Ken
Eli, > > Embeddings are common in generated text. The guiding principle, is > seemingly, when in doubt wrap the string in an embedding. At the UTC, we > heard, that this can lead to very deep stacks - but I've never actually seen > one with more than 63 levels - but that is not my topic here. > >

RE: Bidi Parenthesis Algorithm and BidiCharacterTest.txt

2014-10-15 Thread Whistler, Ken
> > I disagree that this makes N0 a "recursive" rule. It is a rule with > > repeatedly > > applicable subparts. And like nearly all the rules in the UBA (except ones > > which explicitly state that they apply to *original* Bidi_Class values, > > which thus have to be stored across the life of the

RE: Bidi Parenthesis Algorithm and BidiCharacterTest.txt

2014-10-14 Thread Whistler, Ken
Eli asked in response to Andrew: > > · Since 2-17 is now R and not neutral, the resolution of 3-9 is R because > > the > > check for context finds the opening parenthesis at 2 (now R) before the a > at 1. > > Therefore 2-17 is R under N0c2. > > But there's nothing about this in the UAX#9 l

RE: What happened to...?

2014-09-19 Thread Whistler, Ken
Michael, > "Declines to take action” is pretty thin. A proposal which is declined by the UTC doesn't automatically create an obligation to write an extended dissertation explaining the rationale and putting that rationale on record. It might be one thing if there were a lot of controversy involve

RE: Request for Information

2014-07-24 Thread Whistler, Ken
Fantasai asked: > I would like to request that Unicode include, for each writing system it > encodes, some information on how it might justify. > Following up on the comment and examples provided by Richard Wordingham, I'd like to emphasize a relevant point: Scripts may be used for *multiple*

RE: Noto adds CJK, plus new user-facing website

2014-07-16 Thread Whistler, Ken
Andrew, Everybody recognizes the potential risks of getting out too far over one's skis in implementations, but this particular one seems a relatively small risk. Seldom (if ever?) has a NB objected in ballot to these small repertoire additions that have periodically been tacked on at the end of t

RE: Apparent discrepanccy between FAQ and Age.txt

2014-06-10 Thread Whistler, Ken
Karl Williamson noted: > The FAQ http://www.unicode.org/faq/private_use.html#sentinels > says that the last 2 code points on the planes except BMP were made > noncharacters in TUS 3.1. DerivedAge.txt gives 2.0 for these. > The *concept* of noncharacter was not invented until Unicode 3.1, so it

RE: Swift

2014-06-05 Thread Whistler, Ken
Hmmm. Any programming language project that derives from someone who describes himself as a “polyhistor”, which claims to be polymorphic and pasigraphic and multi-lingual and orthogonal and polysynthetic, which draws its inspiration from the theory of “Natural Language Metasemantics”, and which n

RE: UTF-16 Encoding Scheme and U+FFFE

2014-06-03 Thread Whistler, Ken
You cannot even be "very confident" of not finding actual ill-formed UTF-16, like unpaired surrogates, in an external file, let alone noncharacters. As for the noncharacters, take a look at the collation test files that we distribute with each version of UCA. The test data includes test strings li

Block Boundaries (was: RE: Corrigendum #9)

2014-05-30 Thread Whistler, Ken
Skipping over the wording related to noncharacters for the moment, let me address the block stability issue: > I also am curious as to why the consecutive group of 32 noncharacters > can't be split off into its own block instead of being part of an Arabic > one. I'm unaware of any stability polic

RE: Unicode Sets in 'Unicode Regular Expressions'

2014-05-27 Thread Whistler, Ken
http://userguide.icu-project.org/strings/unicodeset Whenever UTS #18 talks of "Unicode sets", it means whatever is actually defined in the class UnicodeSet in ICU. --Ken > UTS#18 'Unicode Regular Expressions' Version 17 Requirement RL1.3 > 'Subtraction and Intersection' talks of Unicode sets.

RE: Indic Syllabic Categories

2014-05-12 Thread Whistler, Ken
Richard Wordingham asked: > Is the provisional property 'Indic_Syllabic_Category' defined by > anything deeper than the UCD file IndicSyllabicCategory itself? Basically, no. It simply gathers together information scattered about in the core spec and elsewhere about claims regarding what all the

Bidi Brackets for Dummies

2014-04-24 Thread Whistler, Ken
Given the incredible level of interest shown on this list during the last week, I am glad that I can finally announce the publication of Bidi Brackets for Dummies: http://www.unicode.org/notes/tr39/ I had wanted to publish that several weeks ago, but unfortunately, publication was held up for mor

RE: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?

2014-04-24 Thread Whistler, Ken
> On 23 Apr 2014, at 22:16, Mathias Bynens wrote: > > > Let’s say I’m writing a program that strips combining characters and > grapheme extenders from an input string. > > > > For combining marks, I’m looking for any non-combining marks (e.g. `a`) > followed by one or more combining marks (e.g. `

RE: ID_Start, ID_Continue, and stability extensions

2014-04-23 Thread Whistler, Ken
Mathias, > > What are the “stability extensions” this document refers to? > > > Here are the code points that match the respective property according to > `DerivedCoreProperties.txt`, yet don’t match these properties if you’re > adding/removing the categories manually based on the property definit

RE: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Whistler, Ken
Ilya noted: > [Below, I completely ignore BIDI part of the specification, and >concentrate ONLY on the parens match. I do not understand why this >question is interlaced with BIDI determination; I trust that it is.] Actually, it is, because the bracket-matching is really only int

RE: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector?

2014-04-02 Thread Whistler, Ken
Yucca noted: > > These glyphic pieces of symbols are only relevant and useful > > in the context of mathematical typesetting programs like TeX. > > I’m not sure whether TeX uses such characters at all. TeX is oriented > towards typesetting glyphs, often not caring that much about abstract > chara

RE: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector?

2014-04-02 Thread Whistler, Ken
Ilya, U+23AF is *definitely* not a variation selector at all. It is part of a set of bracket pieces (and other graphic pieces) in the range U+239B..U+23B1. See discussion of the topic at: http://www.unicode.org/forum/viewtopic.php?f=35&t=206 See also Section 2.13 of UTR #25: http://www.unicod

RE: Bidi reordering of soft hyphen

2014-04-01 Thread Whistler, Ken
> Is it legitimate to truncate the context to a single line? The BiDi > algorithm is attempting to interpret unlabelled text as embedded text > (it's not an arbitrary dance), and in just one line there is no > indicator of whether the hyphen is part of the LTR text embedded in RTL > text.

RE: Bidi reordering of soft hyphen

2014-04-01 Thread Whistler, Ken
Richard Wordingham noted: > As U+2010 HYPHEN would result in text like 'car-', in an English > influenced context I would also go with 'car-'. That's always a possibility, I suppose, but I'm not sure what "English influenced context" means here. The examples I just gave were for a RTL pa

RE: Bidi reordering of soft hyphen

2014-04-01 Thread Whistler, Ken
I don’t think the answer is directly deduced from UAX #9, because it involves deciding where to insert a visible hyphen for display. However, I think the correct answer here is your number two guess, i.e. (in a RTL paragraph context): -car SI TORRAC A way to think about this, rather than starting

RE: Editing Sinhala and Similar Scripts

2014-03-19 Thread Whistler, Ken
And I think you need to distinguish between *proximate* behavior in an editor and editing behavior in general. Once a user enters editing mode, the expectation that we (the software community writing text editors) have built, in interaction with users, is that within reason, something that you ha

RE: Romanized Singhala got great reception in Sri Lanka

2014-03-17 Thread Whistler, Ken
Well, I actually don’t see. I took a look at the Sinhala you inserted in this email. I cannot tell what you did at your input end (about “inserted all joiners”), but there are no actual joiners in the text itself. It displayed just fine in my email (including the correct conditional formatting of

RE: Names for control characters

2014-03-12 Thread Whistler, Ken
Per continued: > I know it's not a name. My question was *why* control characters don't > *have* names like > > CONTROL CHARACTER NULL > CONTROL CHARACTER START OF HEADING > CONTROL CHARACTER START OF TEXT > etc. > > It would be so obvious to have it like that, so I assume there is some

RE: Names for control characters (Was: "(in 6429)" in allkeys.txt)

2014-03-12 Thread Whistler, Ken
Please be very careful here. Having a non-empty value in field 1 of UnicodeData.txt is *not* the same has "having a Unicode name". See: http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207 for the gory details. The "Unicode name" is formally defined in terms of the Name property, which

RE: "(in 6429)" in allkeys.txt

2014-03-11 Thread Whistler, Ken
> I agree that a clarification in the text would be better than > a comment in allkeys.txt. But I also think just changing "(in 6429)" > to "(in ISO 6429)" would be enough. > > (Strange as it might seem for list regulars not everyone immediately > makes the right association from this four-digit

RE: "(in 6429)" in allkeys.txt

2014-03-11 Thread Whistler, Ken
Per asked: > In the DUCET file allkeys.txt, > http://www.unicode.org/Public/UCA/latest/allkeys.txt , > there is "(in 6429)" as a comment for some characters. > I first didn't understand why, but then I realized those are control > characters that are part of ISO/EIC 6429. > > Why is that pointed

RE: Transforming BidiTest.txt to the format of BidiCharacterTest.txt

2014-02-12 Thread Whistler, Ken
Eric, The C version of the bidiref code does that, in part. See the function br_ParseFileFormatB in brinput.c. http://www.unicode.org/Public/PROGRAMS/BidiReferenceC/6.3.0/ It doesn't actually *transform* the BidiTest.txt file to output the other format, but it parses the input and then constru

RE: Engmagate?

2013-12-13 Thread Whistler, Ken
Well, inconceivable? No. Inadvisable? yes. First of all, such “comments” are not actually “comments”—they are the result of a fairly cumbersome and drawn-out process of adding *normative* standardized variation sequences to the standard. Second – although this is a nit – FE0E and FE0F would not

RE: 6.3.0 Bidi implementation snag

2013-10-18 Thread Whistler, Ken
Loren, Your implementation is fine through [Resolving_Implicit_Levels]. And rule I1 *does* set the embedding level of the 0009 from 0 to 1. What you are missing it that rule L1 then *re*sets the level of 0009 back to the paragraph embedding level, i.e. 0. And that is how you get the expected res

RE: Code point vs. scalar value

2013-09-19 Thread Whistler, Ken
Stephan Stiller seems unconvinced by the various attempts to explain the situation. Perhaps an authoritative explanation of the textual history might assist. Stephan demands an answer: I want to know why the Glossary claims that surrogate code points are "[r]eserved for use by UTF-16". Reason

Case Table Compresison Assumptions (was: RE: Posting Links to Ballots (was: RE: Why blackletter letters?))

2013-09-13 Thread Whistler, Ken
Steffen, FYI, Unicode 7.0, when it comes out, will have another entire bicameral (casing) script added to it: Warang Citi. And when Old Hungarian is finally published, at some point after Unicode 7.0, that will be *another* bicameral script added. It is unlikely that those two will be the last. An

RE: Origin of Ellipsis (was: RE: Empty set)

2013-09-13 Thread Whistler, Ken
I wrote: > > As Philippe surmised, it is a compatibility character, originally included > > in the Unicode 1.0 repertoire for cross-mapping to existing legacy > > encodings: > > > > Code Page 932: 0x81 0x64 > > Code Page 949: 0xA1 0xA6 > > Asmus responded: > which just pushes that question forwa

Origin of Ellipsis (was: RE: Empty set)

2013-09-13 Thread Whistler, Ken
Stephan Stiller noted: > Maybe ... and the origin of the single-glyph ellipsis remains a mystery > to me. As Philippe surmised, it is a compatibility character, originally included in the Unicode 1.0 repertoire for cross-mapping to existing legacy encodings: Code Page 932: 0x81 0x64 Code Page 94

V WITH HOOK (was: RE: IPA Greek)

2013-09-12 Thread Whistler, Ken
Julian, > > 028A is ʊ LATIN SMALL LETTER UPSILON > > 028B is ʋ LATIN SMALL LETTER V WITH HOOK > > > > These are used for different sounds. I'm not sure that either name is > particularly bizarre. > > I know what they *mean*. > The name "V WITH HOOK" is strange because there is no hook in ʋ, in >

Posting Links to Ballots (was: RE: Why blackletter letters?)

2013-09-11 Thread Whistler, Ken
David Starner asked: > Would it be possible to post links to the next ballots like these on > this list so that we can comment on them when they're live? It's a lot > harder to discuss them without actual links to the proposals or actual > ballots (more then just the names). Well, technically, no

RE: Why blackletter letters?

2013-09-10 Thread Whistler, Ken
Yucca asked: > As far as I can see, the document summarizes an agreement in an ad hoc > meeting. So it’s not late at all to raise objections, is it? It is way, way, waaay too late to raise objections for these two. Those characters are *published* in ISO/IEC 10646:2011 Amendment 1. They were in

RE: ASCII control codes in sequences of multibyte character sets

2013-08-30 Thread Whistler, Ken
Steffen, Sure. You encounter this problem for any multi-byte EBCDIC-based character encoding. In fact for any single-byte EBCDIC-based character encoding, as well. The EBCDIC control that corresponds to a line feed is either 0x15 or 0x25, depending on revisions. But you wouldn't ordinarily run int

RE: What to backup after corruption of code units?

2013-08-29 Thread Whistler, Ken
> The text in question is not exactly new to Unicode 6.2, probably goes > back to around the time > UTF-8 and UTF-16 were added over a decade ago. Getting a single question > on this passage after all these years would seem to indicate that > confusion isn't exactly rampant. Just to address the t

RE: Just an observation

2013-08-06 Thread Whistler, Ken
Steffen Daode Nurpmeso continued: > Hmm. To me, this raises the question why these constraints were > introduced at all. Imho either one adds constraints due to solid > considerations, and enforces them after some period of backward > compatibility, or there simply should be no constraints. Wha

RE: Just an observation

2013-08-05 Thread Whistler, Ken
Steffen Daode Nurpmeso observed: > Hello, in UAX #44 i read > > Simple_Titlecase_Mapping ... > Note: If this field is null, then the Simple_Titlecase_Mapping > is the same as the Simple_Uppercase_Mapping for this character. > > So a parser has to be aware of this, automatically falling

RE: _Unicode_code_page_and_?.net

2013-08-05 Thread Whistler, Ken
> > On 7/30/2013 3:27 PM, Asmus Freytag wrote: > > > architectures that depended on swapping character sets (code > > > pages) in mid stream > > > > I thought systems were usually married to a particular code page. I'm > > wondering where (historically) you'd actually change to a different > > co

RE: polytonic Greek: diacritics above long vowels ᾱ, ῑ, ῡ

2013-08-05 Thread Whistler, Ken
Poring back over this voluminous thread to Stephan Stiller's original question: > If one wants to indicate vowel length for the length-ambiguous vowels α, > ι, υ in Ancient Greek, one writes ᾱ, ῑ, ῡ. Is there a reason for why > there are no diacritic-precomposed characters? I guess it's because >

RE: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Whistler, Ken
> Suppose that these hex bytes: > > C3 83 C2 B1 > > show up in a message and the message contains no hint what its encoding is. > > Perhaps it is 8859-1, in which case the message consists of four 1-byte > characters: > > C3 = Ã > 83 = the “no break here” character > C2 = Â > B1 = ± >

Scalability of ScriptExtensions (was: RE: Borrowed Thai Punctuation in Tai Tham Text)

2013-07-08 Thread Whistler, Ken
Richard Wordingham asked: > How many examples do I need to collect to add Tai Tham to the script > extensions property for ... ? IMO, a couple well-documented examples ought to suffice. But, this query raises a couple further questions for me regarding the scalability and maintenance of ScriptEx

RE: Suggestion for new dingbats/symbols

2013-05-30 Thread Whistler, Ken
> How to write a mail like this: > "When you arrive at Madrid airport, follow the sign that looks like this: [?]" > Even if the font library supports all needed symbols, it will be easier to > send a photo than to choose the sign from a huge Unicode symbols list. Yep. This discussion about signs

Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-19 Thread Whistler, Ken
However, now that I've got your hopes up on procedural grounds... Getting on to the particulars: > I do have two particular reasons for asking. > > 2. My research. > > There is a document entitled locse027_four_simulations.pdf available from > the following forum post. > > http://forum.high-lo

RE: UTC Document Register Now Public

2013-04-19 Thread Whistler, Ken
William J.G. Overington asked: > Suppose that a member of the public sends a document that seeks discussion > by the Unicode Technical Committee about whether the scope of what > Unicode encodes should be extended in some particular regard, with the > member of the public writing about why he or s

RE: UTC Document Register Now Public

2013-04-16 Thread Whistler, Ken
Karl Pentzlin asked: > >> The Unicode Technical Committee (UTC) document register is now freely > >> available for public access. > > Thank you. > Are the URLs guaranteed to be stable? The short answer is yes. --Ken

RE: Processing Digit Variants

2013-03-18 Thread Whistler, Ken
Richard Wordingham wrote: > European digits (U+0030 to U+0039) may, since Unicode 6.1.0, be used > with variation selectors. As their primary purpose is for use with > u+20E3 COMBINING ENCLOSING KEYCAP, is it legitimate to fail to > recognise strings of digits with variation selectors as represen

RE: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Whistler, Ken
Richard Wordingham wrote: > Actually, there is a subtle and nasty difference, but probably one that > will very rarely strike practical use. It's most obvious manifestation > is in the application of the UCA parametric tailoring > topVariable="u2FD5". U+2FD5 KANGXI RADICAL FLUTE is the last symb

RE: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Whistler, Ken
Richard Wordingham wrote: > > It loosened up the spec, so that the spec itself didn't seem to be > > requiring that each of the first 3 levels had to be expressed with a > > full 16 bits in any collation element table. > > I don't read it that way. But it did allow the 4th weight to go up to > 1

RE: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Whistler, Ken
Richard Wordingham wrote: > One of the changes from Version 6.1.0 to 6.2.0 of the the UCA (UTS#10) > was to changed weights from being 16 bits to just being general > non-negative integers. Was this just to accommodate the 4th weight in > DUCET (scheduled for deletion in Version 6.3.0), or is it

RE: Does HYPHEN BULLET have synonyms?

2013-02-22 Thread Whistler, Ken
Jukka said: > The comments at the start of NamesList.txt say that it is > “semi-automatically derived from UnicodeData.txt”, but the information > you are referring to has actually been picked up from the code charts. > They contain both informative alias names and cross references. The "semi-aut

RE: pIqaD in actual use

2013-02-20 Thread Whistler, Ken
> ... the first language course written in pIqaD > and approved by CBS and Marc Okrand. It was translated by Jonathan > Brown and Okrand and uses the Hol-pIqaD TrueType font. > > That should help at least some of the pIqaD in real use problems, > though not the OMG! Klingon problems. Indeed. Alth

RE: New Canonical Decompositions to Non-Starters

2013-02-18 Thread Whistler, Ken
Well, it isn't prohibited, so I guess you will need to be forever vigilant in view of the possibility that somebody might get it in their head to encode some combining mark that isn't already accounted for in Tibetan *and* that they would simultaneously insist that a precomposed form of that mar

RE: FCD and Collation

2013-02-11 Thread Whistler, Ken
> Does anyone feel up to rigorously justifying revisions to the concepts > and algorithms of FCD and canonical closure? Occasionally one will > encounter cases where the canonical closure is infinite - in these > cases, normalisation will be necessary regardless of the outcome of the > FCD check.

FW: Why are the low surrogates numerically larger than the high surrogates?

2013-01-23 Thread Whistler, Ken
-Original Message- From: ken.whist...@sap.com Sent: Wednesday, January 23, 2013 10:48 AM To: 'Costello, Roger L.' Subject: RE: Why are the low surrogates numerically larger than the high surrogates? > Why are the low surrogates numerically larger than the high surrogates? > > That i

RE: RLI and "bdi", and how to get an update of changes

2013-01-15 Thread Whistler, Ken
> >> what does that different from the RLI > >> U+2067/PDF U+2068? if it is the same, can we use U+2066 in HTML > replacing > >> ""? > > Code points 2066, 2067, and 2068 are unassigned. I presume you mean > U+202B RIGHT-TO-LEFT EMBEDDING (RLE) and U+202C POP DIRECTIONAL > FORMATTING. No, actua

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-08 Thread Whistler, Ken
> Sorry, but I have to disagree here. If a list of strings contains items > with lone surrogates (garbage), then sorting them doesn't make the > garbage go away, even if the items may be sorted in "correct" order > according to some criterion. Well, yeah, I wasn't claiming that the principled, "co

RE: Q is a Roman numeral?

2013-01-07 Thread Whistler, Ken
I'm gonna take a wild stab here and assume that this is "Q" as the medieval Latin abbreviation for "quingenti", which usually means 500, but also gets glossed just as a big number, as in "milia quingenta" "thousands upon thousands". Maybe some medieval scribe substituted a Q for |V| (with an ov

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Whistler, Ken
> > http://www.unicode.org/reports/tr10/#Handline_Illformed Grrr. http://www.unicode.org/reports/tr10/#Handling_Illformed I seem unable to handle ill-formed spelling today. :( --Ken

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Whistler, Ken
Martin, The kind of situation Markus is talking about is illustrated particularly well in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to this issue,: http://www.unicode.org/reports/tr10/#Handline_Illformed When weighting Unicode 16-bit strings for collation, you can

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Whistler, Ken
Philippe also said: > ... Reserving "UTF-16" for what the stadnard discusses as a > "16-bit string", except that it should still require UTF-16 > conformance (no unpaired surrogates and no non-characters) ... For those following along, conformance to UTF-16 does *NOT* require "no non-characters"

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Whistler, Ken
Philippe Verdy said: > Well then I don't know why you need a definition of an "Unicode 16-bit > string". For me it just means exactly the same as "16-bit string", and > the encoding in it is not relevant given you can put anything in it > without even needing to be conformant to Unicode. So a Java

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Whistler, Ken
One of the reasons why the Unicode Standard avoids the term “valid string”, is that it immediate begs the question, valid *for what*? The Unicode string is just a sequence of 3 Unicode characters. It is valid *for* use in internal processing, because for my own processing I can decide I need t

RE: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Whistler, Ken
Yannis' use of the terminology "not ... a valid string in Unicode" is a little confusing there. A Unicode string with the sequence, say, (a combining grave mark, followed by "a"), is "valid" Unicode in the sense that it just consists of two Unicode characters in a sequence. It is aberrant, ce

RE: holes (unassigned code points) in the code charts

2013-01-04 Thread Whistler, Ken
Whoops! http://www.unicode.org/alloc/CurrentAllocation.html --Ken > The editors maintain some statistical information relevant to this fun > question > at: > > http://www.unicode.org/alloc/CurrentAllocaiton.html

RE: holes (unassigned code points) in the code charts

2013-01-04 Thread Whistler, Ken
Stephan Stiller continued: > Occasionally the question is asked how many characters Unicode has. This > question has an answer in section D.1 of the Unicode Standard. I > suspect, however, that once in a while the motivation for asking this > question is to find out how much of Unicode has been "u

RE: Jamo_Short_Name

2013-01-02 Thread Whistler, Ken
André Schappo asked: > Been looking at http://www.unicode.org/Public/UNIDATA/Jamo.txt > > There appears to be 2 different romanizations at play in the file? One for the > short name and another for the full name > eg 1100; G # HANGUL CHOSEONG KIYEOK > > I have searched unicode.org but cannot f

RE: locale-aware string comparisons

2012-12-31 Thread Whistler, Ken
Well, in answering the question which was actually posed here: 1. ISO/IEC 10646 has absolutely nothing to say about this issue, because 10646 does not define case mapping at all. 2. The Unicode Standard *does* define case mapping, of course, as well as case folding. The relevant details are in

RE: UCA and Russian letter Ё

2012-12-26 Thread Whistler, Ken
Leo asked: > My question was narrower: assuming that the strings being compared are > words, could it be supported without any markup? ... where "it" refers to conditional weighting based on the (identified) word boundary. And the answer to that is no, unless the word boundary was explicitly in

RE: UCA and Russian letter Ё

2012-12-26 Thread Whistler, Ken
The UCA algorithm itself has no "opinion" on this issue. It is simply a specification of *how* to compare strings at multiple levels, given a multi-level collation weight table. The UCA *does* have a default behavior, of course, based on the DUCET table. And the DUCET table puts all Unicode cha

RE: UCA and Russian letter Ё

2012-12-21 Thread Whistler, Ken
Leo Broukhis said: > Granted, not yet, but by itself the argument is invalid. Unicode > collation rules are descriptive; I'm not sure what you mean by that. UTS #10 is a *specification* of an algorithm, with various options for tailoring and parameterization which make it possible to accommoda

RE: Character name translations

2012-12-20 Thread Whistler, Ken
Jukka Korpela noted: > The standard ISO 10646, which is equivalent to Unicode as regards to > character names, is published in French, too Actually ISO/IEC 10646 is *not* published in French, too. But a related standard, the international string ordering standard, ISO/IEC 14651 (the one whose

RE: Question about normalization tests

2012-12-10 Thread Whistler, Ken
Your misunderstanding is at the highlighted statement below. Actually 0300 *is* blocked from 0061 in this sequence, because it is preceded by a character with the same canonical combining class (i.e. U+0305, ccc=230). A blocking context is the preceding combining character either having ccc=0 or

What is happening with hieroglyphs (was: RE: Why 17 planes?)

2012-11-28 Thread Whistler, Ken
Philippe is (apparently) referring to higher-level protocols for markup of hieroglyphic text. See, e.g., Table 14-10 and Figure 14-2, p. 489 in Section 14.18, Egyptian Hieroglyphs in TUS 6.2: http://www.unicode.org/versions/Unicode6.2.0/ch14.pdf Similar kinds of higher-level protocols are envis

RE: Why 17 planes? (was: Re: Why 11 planes?)

2012-11-27 Thread Whistler, Ken
There isn't an actual problem here which needs a solution, satisfactory, or otherwise. The persistence of the "17 planes may not be enough" meme on this list is an interesting phenomenon in itself, but has no practical impact on any of the actual ongoing work on maintenance of the encoding stand

RE: StandardizedVariants.txt error?

2012-11-26 Thread Whistler, Ken
Actually, I think the omission here is the word "canonical". In other words, Section 16.4 should probably read: "The base character in a variation sequence is never a combining character or a *canonical* decomposable character." Note that with this addition, StandardizedVariants.txt poses no co

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
Yep. --Ken Latin1 explicitly gives no semantics to several byte values (for example 0x81), but acknowleges that other standards will define their semantics. Unicode provides code-points with equally-undefined semantics so that these bytes can pass through without change. This allows a byte-leve

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
No Unicode doesn't. But yes, is *does* follow that decoding C0/C1 control codes produces a Unicode code point of equal value. RTFM. TUS 6.2, p. 544: "There are 65 code points set aside in the Unicode Standard for compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022 framewor

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
A IANA-registered character *map* is a very different animal from a character encoding standard per se. The actual character encoding standard, ISO/IEC 8859-1:1998 does not define the C0 and C1 control codes (and never will). That was what I was quoting from. A mapping table, on the other hand,

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
Actually, what Buck really needs is Section 16.1 Control Codes: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf That explains the situation for the *non* graphic characters in the range U+..U+00FF, which is the source of the concern for Buck's skeptical workmates, I'm sure. --Ken >

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
The first 256 characters of the Unicode Standard *are* compatible with ISO/IEC 8859-1 (Latin-1), but you need to distinguish what happens for the graphic characters from what happens for the control codes. ISO 8859-1 defines *graphic* characters in the ranges 0x20..0x7E, 0xA0..0xFF. Those are e

RE: VS: Mayan numerals

2012-09-26 Thread Whistler, Ken
Marion Gunn wrote: > -Original Message- > From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] > On Behalf Of Marion Gunn > Sent: Wednesday, September 26, 2012 10:53 AM > To: 'Unicode List' > Subject: Re: VS: Mayan numerals ... > > This simple request to encode Mayan numer