On Thu, Jun 23, 2016 at 12:56 PM, Xiaodi Wu <xiaodi...@gmail.com> wrote:
> On Thu, Jun 23, 2016 at 12:41 PM, João Pinheiro <j...@joaopinheiro.org> > wrote: > >> There are two different issues here, individual character normalisation >> and identifier canonicalisation. NFC handles character normalisation and it >> definitely should be part of the proposal since identifier canonicalisation >> doesn't make sense if the individual character representation isn't >> normalised first. >> > > I think we're using terminology differently here. What you call "character > normalization" is what I'm calling canonicalization. NFC is described in > UAX #15 as "canonical decomposition followed by canonical composition" and > I'm just using the word "canonicalization" because it's shorter. If Swift > represents each identifier in an NFC-transformed form (what I call > canonicalized), then I understand the identifier to be canonicalized. What > is the distinction you're drawing here? > > >> >> Swift currently doesn't normalise unicode characters, as can be seen in >> the following code example: >> >> let Å = "Hello" // Angstrom >> let Å = "Swift" // Latin Capital Letter A With Ring Above >> let Å = "World" // Latin Capital Letter A + Combining Ring Above >> >> print(Å) >> print(Å) >> print(Å) >> >> According to the unicode standard, all 3 of these characters should be >> normalised into the same representation. >> >> Just re-read UAX #31. I see two different issues here too--do these match up with what you're saying above? * Disallowing certain glyphs in identifiers. To do so, we can implement the recommendation to disallow all glyphs in UAX #31 Table 4, except ZWJ and ZWNJ in the specific scenarios outlined in section 2.3. * Internally, when comparing two identifiers A and B, compare NFC(A) and NFC(B) without modifying or otherwise restricting the actual user-facing code to contain only NFC-normalized strings. This would be the approach recommended in section 1.3. > Sincerely, >> João Pinheiro >> >> >> On 23 Jun 2016, at 17:40, Xiaodi Wu <xiaodi...@gmail.com> wrote: >> >> I think this issue is bigger than that. As UAX #31 suggests, the most >> appropriate approach is canonicalizing identifiers by NFC, with specific >> treatment of ZWJ and ZWNJ by allowing them in three contexts, which will >> require thought as to how to implement. >> >> Given that there is a specifically recommended algorithm on how to handle >> this issue, I'm also not sure anymore that this requires a proposal; >> "process Unicode correctly" is really more of a bug fix because, given the >> strict limits of what's canonicalized, there shouldn't be a user-facing >> effect if we are merely proposing to prohibit glyphs from appearing in >> certain contexts where they are never in fact encountered in real language. >> >> On Thu, Jun 23, 2016 at 11:19 AM Sean Heber <s...@fifthace.com> wrote: >> >>> I’m no unicode expert, but this sounds like the way to go to me. >>> >>> l8r >>> Sean >>> >>> >>> > On Jun 23, 2016, at 11:17 AM, João Pinheiro via swift-evolution < >>> swift-evolution@swift.org> wrote: >>> > >>> > >>> >> On 21 Jun 2016, at 20:15, Xiaodi Wu via swift-evolution < >>> swift-evolution@swift.org> wrote: >>> >> >>> >> On Tue, Jun 21, 2016 at 1:16 PM, Joe Groff <jgr...@apple.com> wrote: >>> >> Any discussion about this ought to start from UAX #31, the Unicode >>> consortium's recommendations on identifiers in programming languages: >>> >> >>> >> http://unicode.org/reports/tr31/ >>> >> >>> >> Section 2.3 specifically calls out the situations in which ZWJ and >>> ZWNJ need to be allowed. The document also describes a stability policy for >>> handling new Unicode versions, other confusability issues, and many of the >>> other problems with adopting Unicode in a programming language's syntax. >>> >> >>> >> That's a fantastic document--a very edifying read. Given Swift's >>> robust support for Unicode in its core libraries, it's kind of surprising >>> to me that identifiers aren't canonicalized at compile time. From a quick >>> first read, faithful adoption of UAX #31 recommendations would address most >>> if not all of the confusability and zero-width security issues raised in >>> this conversation. >>> > >>> > From what I've read of UAX #31 it does seem to address all of the >>> invisible character issues raised in the discussion. Given their unicode >>> status of of Default_Ignorable_Code_Points, I believe the best course of >>> action would be to canonicalise identifiers by allowing invisible >>> characters only where appropriate and ignoring them everywhere else. >>> > >>> > The alternative to ignoring them would be to not canonicalise >>> identifiers and treat invisible characters as an error instead. >>> > >>> > This doesn't address the issue of unicode confusable characters, but >>> solving that has additional problems of its own and would probably be >>> better addressed in a different proposal entirely. >>> > >>> > I'd like to start writing the proposal if there is agreement that this >>> would be the best course of action. >>> > >>> > Sincerely, >>> > João Pinheiro >>> > _______________________________________________ >>> > swift-evolution mailing list >>> > swift-evolution@swift.org >>> > https://lists.swift.org/mailman/listinfo/swift-evolution >>> >>> >> >
_______________________________________________ swift-evolution mailing list swift-evolution@swift.org https://lists.swift.org/mailman/listinfo/swift-evolution