RE: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Shawn Steele via Unicode
IMO, encodings, particularly ones depending on state such as this, may have 
multiple ways to output the same, or similar, sequences.  When means that 
pretty much any time an encoding transforms data any previous security or other 
validation style checks are no longer valid and any security/validation must be 
checked for again.  I've seen numerous mistakes due to people expecting 
encodings to play nicely, particularly if there are different endpoints that 
may use different implementations with slightly different behaviors.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Henri Sivonen via 
Unicode
Sent: Sunday, August 16, 2020 11:39 PM
To: Mark Davis ☕️ 
Cc: Unicode Public 
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP 
escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there 
>> is no content between two ISO-2022-JP escape sequences from the 
>> WHATWG Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in 
>> that case is not a useful security measure when unnecessary 
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal 
>> > input byte sequence. Instead, it must stop with an error or 
>> > substitute a replacement character (such as U+FFFD (   ) 
>> > REPLACEMENT CHARACTER) or an escape sequence in the output. (See 
>> > also Section 3.5 Deletion of Code Points.) It is important to do 
>> > this not only for byte sequences that encode characters, but also for 
>> > unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next 
>> > shift sequence. The formal syntaxes for HZ and most CJK ISO-2022 
>> > variants require at least one character in a text segment between 
>> > shift sequences. Security software written to the formal 
>> > specification may not detect malicious text  (for example, "delete" 
>> > with a shift-to-double-byte then an immediate shift-to-ASCII in the 
>> > middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of 
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into 
>> > its ISO-2022-JP decoder algorithm 
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements 
>> > the WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that 
>> > didn't implement this U+FFFD generation behavior (uconv), a bug has 
>> > been logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-44066140
>> > 3
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP 
>> > the unusual and surprising property that concatenating two 
>> > ISO-2022-JP outputs from a conforming encoder can result in a byte 
>> > sequence that is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP 
>> > escape sequence is immediately followed by another ISO-2022-JP 
>> > escape sequence. Chrome and Safari do, but their implementations of 
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's 
>> > decoder implementations generally are informed by the Encoding 
>> > Standard (though the ISO-2022-JP decoder specifically might not be 
>> > yet), and I suspect that Safari's implementation (ICU) is either 
>> > informed by Unicode Security Considerations or vice versa.
>> >
>> > The example given as rationale in Unicode Security Considerations, 
>> > obfuscating the ASCII string "delete", could be accomplished by 
>> > alternating between the ASCII and Roman states to that every other 
>> > character is in the ASCII state and the rest of the Roman 

RE: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Shawn Steele via Unicode
I'm not opposed to a sub-bloc for "Modern Hieroglyphs"  

I confess that even though I know nothing about Hieroglyphs, that I find it 
fascinating that such a thoroughly dead script might still be living in some 
way, even if it's only a little bit.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Ken Whistler via 
Unicode
Sent: Thursday, February 13, 2020 12:08 PM
To: Phake Nick 
Cc: unicode@unicode.org
Subject: Re: Egyptian Hieroglyph Man with a Laptop

You want "dubious"?!

You should see the hundreds of strange characters already encoded in the CJK 
*Unified* Ideographs blocks, as recently documented in great detail by Ken 
Lunde:

https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf

Compared to many of those, a hieroglyph of a man (or woman) holding a laptop is 
positively orthodox!

--Ken

On 2/13/2020 11:47 AM, Phake Nick via Unicode wrote:
> Those characters could also be put into another block for the same 
> script similar to how dubious characters in CJK are included by 
> placing them into "CJK Compatibility Ideographs" for round trip 
> compatibility with source encoding.



RE: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Shawn Steele via Unicode
> From the point of view of Unicode, it is simpler: If the character is in use 
> or have had use, it should be included somehow.

That bar, to me, seems too low.  Many things are only used briefly or in a 
private context that doesn't really require encoding.

The hieroglyphs discussion is interesting because it presents them as living 
(in at least some sense) even though they're a historical script.  Apparently 
modern Egyptologists are coopting them for their own needs.  There are lots of 
emoji for professional fields.  In this case since hieroglyphs are pictorial, 
it seems they've blurred the lines between the script and emoji.  Given their 
field, I'd probably do the same thing.

I'm not opposed to the character if Egyptologists use it amongst themselves, 
though it does make me wonder if it belongs in this set?  Are there other 
"modern" hieroglyphs?  (Other than the errors, etc mentioned earlier, but 
rather glyphs that have been invented for modern use).

-Shawn 




RE: Unicode "no-op" Character?

2019-07-03 Thread Shawn Steele via Unicode
I think you're overstating my concern :)

I meant that those things tend to be particular to a certain context and often 
aren't interesting for interchange.  A text editor might find it convenient to 
place word boundaries in the middle of something another part of the system 
thinks is a single unit to be rendered.  At the same time, a rendering engine 
might find it interesting that there's an ff together and want to mark it to be 
shown as a ligature though that text editor wouldn't be keen on that at all.

As has been said, these are private mechanisms for things that individual 
processes find interesting.  It's not useful to mark those for interchange as 
the text editors word breaking marks would interfere with the graphics engines 
glyph breaking marks.  Not to mention the transmission buffer size marks 
originally mentioned, which could be anywhere.

The "right" thing to do here is to use an internal higher level mechanism to 
keep track of these things however the component needs.  That can even be 
interchanged with another component designed to the same principles, via 
mechanisms like the PUA.  However, those components can't expect their private 
mechanisms are useful or harmless to other processes.  

Even more complicated is that, as pointed out by others, it's pretty much 
impossible to say "these n codepoints should be ignored and have no meaning" 
because some process would try to use codepoints 1-3 for some private meaning.  
Another would use codepoint 1 for their own thing, and there'd be a conflict.  

As a thought experiment, I think it's certainly decent to ask the question 
"could such a mechanism be useful?"  It's an intriguing thought and a decent 
hypothesis that this kind of system could be privately useful to an 
application.  I also think that the conversation has pretty much proven that 
such a system is mathematically impossible.  (You can't have a "private" 
no-meaning codepoint that won't conflict with other "private" uses in a public 
space).

It might be worth noting that this kind of thing used to be fairly common in 
early computing.  Word processers would inject a "CTRL-I" token to toggle 
italics on or off.  Old printers used to use sequences to define the start of 
bold or italic or underlined or whatever sequences.  Those were private and 
pseudo-private mechanisms that were used internally &/or documented for others 
that wanted to interoperate with their systems.  (The printer folks would tell 
the word processers how to make italics happen, then other printer folks would 
use the same or similar mechanisms for compatibility - except for the dude that 
didn't get the memo and made their own scheme.)

Unicode was explicitly intended *not* to encode any of that kind of markup, 
and, instead, be "plain text," leaving other interesting metadata to other 
higher level protocols.  Whether those be word breaking, sentence parsing, 
formatting, buffer sizing or whatever.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Richard Wordingham via 
Unicode
Sent: Wednesday, July 3, 2019 4:20 PM
To: unicode@unicode.org
Subject: Re: Unicode "no-op" Character?

On Wed, 3 Jul 2019 17:51:29 -0400
"Mark E. Shoulson via Unicode"  wrote:

> I think the idea being considered at the outset was not so complex as 
> these (and indeed, the point of the character was to avoid making 
> these kinds of decisions).

Shawn Steele appeared to be claiming that there was no good, interesting reason 
for separating base character and combining mark.  I was refuting that notion.  
Natural text boundaries can get very messy - some languages have word 
boundaries that can be *within* an indecomposable combining mark.

Richard.



RE: Unicode "no-op" Character?

2019-06-23 Thread Shawn Steele via Unicode
But... it's not actually discardable.  The hypothetical "packet" architecture 
(using the term architecture somewhat loosely) needed the information being 
tunneled in by this character.  If it was actually discardable, then the "noop" 
character wouldn't be required as it would be discarded.

Since the character conveys meaning to some parts of the system, then it's not 
actually a "noop" and it's not actually "discardable".  

What is actually being requested isn't a character that nobody has meaning for, 
but rather a character that has no PUBLIC meaning.  

Which leads us to the key.  The desire is for a character that has no public 
meaning, but has some sort of private meaning.  In other words it has a private 
use.  Oddly enough, there is a group of characters intended for private use, in 
the PUA ;-)

Of course if the PUA characters interfered with the processing of the string, 
they'd need to be stripped, but you're sort of already in that position by 
having a private flag in the middle of a string.

-Shawn  

-Original Message-
From: Unicode  On Behalf Of Slawomir Osipiuk via 
Unicode
Sent: Saturday, June 22, 2019 6:10 PM
To: unicode@unicode.org
Cc: 'Richard Wordingham' 
Subject: RE: Unicode "no-op" Character?

That's the key to the no-op idea. The no-op character could not ever be assumed 
to survive interchange with another process. It'd be canonically equivalent to 
the absence of character. It could be added or removed at any position by a 
Unicode-conformant process. A program could wipe all the no-ops from a string 
it has received, and insert its own for its own purposes. (In fact, it should 
wipe the old ones so as not to confuse
itself.) It's "another process's discardable junk" unless known, 
internally-only, to be meaningful at a particular stage.

While all the various (non)joiners/ignorables are interesting, none of them 
have this property.

In fact, that might be the best description: It's not just an "ignorable", it's 
a "discardable". Unicode doesn't have that, does it?

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham via Unicode
Sent: Saturday, June 22, 2019 20:59
To: unicode@unicode.org
Cc: Shawn Steele
Subject: Re: Unicode "no-op" Character?

If they're conveying an invisible message, one would have to strip out original 
ZWNBSP/WJ/ZWSP that didn't affect line-breaking.  The weak point is that that 
assumes that line-break opportunities are well-defined.  For example, they 
aren't for SE Asian text.

Richard.




RE: Unicode "no-op" Character?

2019-06-22 Thread Shawn Steele via Unicode
Assuming you were using any of those characters as "markup", how would you know 
when they were intentionally in the string and not part of your marking system?

-Original Message-
From: Unicode  On Behalf Of Richard Wordingham via 
Unicode
Sent: Saturday, June 22, 2019 4:17 PM
To: unicode@unicode.org
Subject: Re: Unicode "no-op" Character?

On Sat, 22 Jun 2019 17:50:49 -0400
Sławomir Osipiuk via Unicode  wrote:

> If faced with the same problem today, I’d probably just go with U+FEFF 
> (really only need a single char, not a whole delimited substring) or a 
> different C0 control (maybe SI/LS0) and clean up the string if it 
> needs to be presented to the user.

You'd really want an intelligent choice between U+FEFF (ZWNBSP) (better
U+2060 WJ) and U+200B (ZWSP).  

> I still think an “idle”/“null tag”/“noop”  character would be a neat 
> addition to Unicode, but I doubt I can make a convincing enough case 
> for it.

You'd still only be able to insert it between characters, not between code 
units, unless you were using UTF-32.

Richard.




RE: Unicode "no-op" Character?

2019-06-22 Thread Shawn Steele via Unicode
+ the list.  For some reason the list's reply header is confusing.

From: Shawn Steele
Sent: Saturday, June 22, 2019 4:55 PM
To: Sławomir Osipiuk 
Subject: RE: Unicode "no-op" Character?

The original comment about putting it between the base character and the 
combining diacritic seems peculiar.  I'm having a hard time visualizing how 
that kind of markup could be interesting?

From: Unicode mailto:unicode-boun...@unicode.org>> 
On Behalf Of Slawomir Osipiuk via Unicode
Sent: Saturday, June 22, 2019 2:02 PM
To: unicode@unicode.org<mailto:unicode@unicode.org>
Subject: RE: Unicode "no-op" Character?

I see there is no such character, which I pretty much expected after Google 
didn't help.

The original problem I had was solved long ago but the recent article about 
watermarking reminded me of it, and my question was mostly out of curiosity. 
The task wasn't, strictly speaking, about "padding", but about marking - 
injecting "flag" characters at arbitrary points in a string without affecting 
the resulting visible text. I think we ended up using ESC, which is a dumb 
choice in retrospect, though the whole approach was a bit of a hack anyway and 
the process it was for isn't being used anymore.


RE: Unicode "no-op" Character?

2019-06-21 Thread Shawn Steele via Unicode
I'm curious what you'd use it for?

From: Unicode  On Behalf Of Slawomir Osipiuk via 
Unicode
Sent: Friday, June 21, 2019 5:14 PM
To: unicode@unicode.org
Subject: Unicode "no-op" Character?

Does Unicode include a character that does nothing at all? I'm talking about 
something that can be used for padding data without affecting interpretation of 
other characters, including combining chars and ligatures. I.e. a character 
that could hypothetically be inserted between a latin E and a combining acute 
and still produce É. The historical description of U+0016 SYNCHRONOUS IDLE 
seems like pretty much exactly what I want. It only has one slight 
disadvantage: it doesn't work. All software I've tried displays it as an 
unknown character and it definitely breaks up combinations. And U+ NULL 
seems even worse.

I can imagine the answer is that this thing I'm looking for isn't a character 
at all and so should be the business of "a higher-level protocol" and not what 
Unicode was made for... but Unicode does include some odd things so I wonder if 
there is something like that regardless. Can anyone offer any suggestions?

Sławomir Osipiuk


RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode
>> If they are obsolete apps, they don’t use CLDR / ICU, as these are designed 
>> for up-to-date and fully localized apps. So one hassle is off the table.

Windows uses CLDR/ICU.  Obsolete apps run on Windows.  That statement is a 
little narrowminded.

>> I didn’t look into these date interchanges but I suspect they won’t use any 
>> thousands separator at all to interchange data.

Nope

>> The group separator is only for display and print

Yup, and people do the wrong thing so often that I even blogged about it. 
https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/

>> Sorry you did skip this one:

Oops, I did mean to respond to that one and accidentally skipped it.

>> What are all these expected to do while localized with scripts outside 
>> Windows code pages?

(We call those “unicode-only” locales FWIW)

The users that are not supported by legacy apps can’t use those apps 
(obviously).  And folks are strongly encouraged to write apps (and protocols) 
that Use Unicode (I’ve blogged about that too).  However, the fact that an app 
may run very poorly in Cherokee or whatever doesn’t mean that there aren’t a 
bunch of French enterprises that depend on that app for their day-to-day 
business.

In order for the “unicode-only” locale users to use those apps, the app would 
need to be updated, or another app with the appropriate functionality would 
need to be selected.

However, that still doesn’t impact the current French users that are “ok” with 
their current non-Unicode app.  Yes, I would encourage them to move to Unicode, 
however they tend to not want to invest in migration when they don’t see an 
urgent need.

Since Windows depends on CLDR and ICU data, updates to that data means that 
those customers can experience pain when trying to upgrade to newer versions of 
Windows.  We get those support calls, they don’t tend to pester CLDR.

Which is why I suggested an “opt-in” alt form that apps wanting “civilized” 
behavior could opt-into (at least for long enough that enough badly behaved 
apps would be updated to warrant moving that to the default.)

The data for locales like French tends to have been very stable for decades.  
Changes to data for major locales like that are more disruptive than to newer 
emerging markets where the data is undergoing more churn.

-Shawn



RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode
>> Keeping these applications outdated has no other benefit than providing a 
>> handy lobbying tool against support of NNBSP.
I believe you’ll find that there are some French banks and other institutions 
that depend on such obsolete applications (unfortunately).
Additionally, I believe you’ll find that there are many scenarios where older 
applications and newer applications need to exchange data.  Either across the 
network, the web, or even on the same machine.  One app expecting NNBSP and 
another expecting NBSP on the same machine will likely lead to confusion.
This could be something a “new” app running with the latest & greatest locale 
data and trying to import the legacy data users had saved on that app.  Or 
exchanging data with an application using the system settings which are perhaps 
older.
>> Also when you need those apps, just tailor your French accordingly.
Having the user attempt to “correct” their settings may not be sufficient to 
resolve these discrepancies because not all applications or frameworks properly 
consider the user overrides on all platforms.
>> That should not impact all other users out there interested in a civilized 
>> layout.
I’m not sure that the choice of the word “civilized” adds value to the 
conversation.  We have pretty much zero feedback that the OS’s French 
formatting is “uncivilized” or that the NNBSP is required for correct support.
>> As long as SegoeUI has NNBSP support, no worries, that’s what CLDR data is 
>> for.
For compatibility, I’d actually much prefer that CLDR have an alt “best 
practice” field that maintained the existing U+00A0 behavior for compatibility, 
yet allowed applications wanting the newer typographic experience to opt-in to 
the “best practice” alternative data.  As applications became used to the idea 
of an alternative for U+00A0, then maybe that could be flip-flopped and put 
U+00A0 into a “legacy” alt form in a few years.
Normally I’m all for having the “best” data in CLDR, and there are many locales 
that have data with limited support for whatever reasons.  U+00A0 is pretty 
exceptional in my view though, developers have been hard-coding dependencies on 
that value for ½ a century without even realizing there might be other types of 
non-breaking spaces.  Sure, that’s not really the best practice, particularly 
in modern computing, but I suspect you’ll still find it taught in CS classes 
with little regard to things like NNBSP.
-Shawn



RE: NNBSP

2019-01-18 Thread Shawn Steele via Unicode
I've been lurking on this thread a little.

This discussion has gone “all over the place”, however I’d like to point out 
that part of the reason NBSP has been used for thousands separators is because 
that it exists in all of those legacy codepages that were mentioned predating 
Unicode.

Whether or not NNBSP provides a better typographical experience, there are a 
lot of legacy applications, and even web services, that depend on legacy 
codepages.  NNBSP may be best for layout, but I doubt that making it work 
perfectly for thousand separators is going to be some sort of magic bullet that 
solves any problems that NBSP provides.

If folks started always using NNBSP, there are a lot of legacy applications 
that are going to start giving you ? in the middle of your numbers. 

Here’s a partial “dir > out.txt” after changing my number thousands separator 
to NNBSP in French on Windows (for example).
13/01/2019  09:4815?360 AcXtrnal.dll
13/01/2019  09:4654?784 AdaptiveCards.dll
13/01/2019  09:4667?584 AddressParser.dll
13/01/2019  09:4724?064 adhapi.dll
13/01/2019  09:4797?792 adhsvc.dll
10/04/2013  08:32   154?624 AdjustCalendarDate.exe
10/04/2013  08:32 1?190?912 AdjustCalendarDate.pdb
13/01/2019  10:47   534?016 AdmTmpl.dll
13/01/2019  09:4858?368 adprovider.dll
13/01/2019  10:47   136?704 adrclient.dll
13/01/2019  09:48   248?832 adsldp.dll
13/01/2019  09:46   251?392 adsldpc.dll
13/01/2019  09:48   101?376 adsmsext.dll
13/01/2019  09:48   350?208 adsnt.dll
13/01/2019  09:46   849?920 adtschema.dll
13/01/2019  09:45   146?944 AdvancedEmojiDS.dll

There are lots of web services that still don’t expect UTF-8 (I know, bad on 
them), and many legacy applications that don’t have proper UTF-8 or Unicode 
support (I know, they should be updated).  It doesn’t seem to me that changing 
French thousands separator to NNBSP solves all of the perceived problems.

-Shawn

 
http://blogs.msdn.com/shawnste



RE: Why so much emoji nonsense? - Proscription

2018-02-15 Thread Shawn Steele via Unicode
Depends on your perspective I guess ;)

-Original Message-
From: Unicode <unicode-boun...@unicode.org> On Behalf Of Richard Wordingham via 
Unicode
Sent: Thursday, February 15, 2018 2:31 PM
To: unicode@unicode.org
Subject: Re: Why so much emoji nonsense? - Proscription

On Thu, 15 Feb 2018 21:38:19 +
Shawn Steele via Unicode <unicode@unicode.org> wrote:

> I realize "I'd've" isn't
> "right",

Where did that proscription come from?  Is it perhaps a perversion of the 
proscription of "I'd of"?

Richard.



RE: Why so much emoji nonsense?

2018-02-15 Thread Shawn Steele via Unicode
For voice we certainly get clues about the speaker's intent from their tone.  
That tone can change the meaning of the same written word quite a bit.  There 
is no need for video to wildly change the meaning of two different readings of 
the exact same words.

Writers have always taken liberties with the written word to convey ideas that 
aren't purely grammatically correct.  This may be most obvious in poetry, but 
it happens even in other writings.  Maybe their entire reason was so that 
future English teachers would ask us why some author chose some peculiar 
structure or whatever.

I find it odd that I write things like "I'd've thought" (AFAIK I hadn't been 
exposed to I'd've and it just spontaneously occurred, but apparently others 
(mis)use it as well).  I realize "I'd've" isn't "right", but it better conveys 
my current state of mind than spelling it out would've.  Similarly, if I find 
myself smiling internally while I'm writing, it's going to get a :)

Though I may use :), I agree that most of my use of emoji is more decorative, 
however including other emoji can also make the sentence feel more "fun".  

If I receive a  as the only response to a comment I made, that conveys 
information that I would have a difficult time putting into words.

I don't find emoji to necessarily be a "post-literate" thing.  Just a different 
way of communicating.  I have also seen them used in a "pre-literate" fashion.  
Helping people that were struggling to learn to read get past the initial 
difficulties they were having on their way to becoming more literate.

-Shawn

-Original Message-
From: Unicode  On Behalf Of James Kass via Unicode
Sent: Thursday, February 15, 2018 12:53 PM
To: Ken Whistler 
Cc: Erik Pedersen ; Unicode Public 
Subject: Re: Why so much emoji nonsense?

Ken Whistler replied to Erik Pedersen,

> Emoticons were invented, in large part, to fill another major hole in 
> written communication -- the need to convey emotional state and 
> affective attitudes towards the text.

There is no such need.  If one can't string words together which 'speak for 
themselves', there are other media.  I suspect that emoticons were invented for 
much the same reason that "typewriter art"
was invented:  because it's there, it's cute, it's clever, and it's novel.

> This is the kind of information that face-to-face communication has a 
> huge and evolutionarily deep bandwidth for, but which written 
> communication typically fails miserably at.

Does Braille include emoji?  Are there tonal emoticons available for telephone 
or voice transmission?  Does the telephone "fail miserably"
at oral communication because there's no video to transmit facial tics and hand 
gestures?  Did Pontius Pilate have a cousin named Otto?
These are rhetorical questions.

For me, the emoji are a symptom of our moving into a post-literate age.  We 
already have people in positions of power who pride themselves on their 
marginal literacy and boast about the fact that they don't read much.  Sad!



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
But those are IETF definitions.  They don’t have to mean the same thing in 
Unicode - except that people working in this field probably expect them to.

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Asmus Freytag 
via Unicode
Sent: Thursday, June 1, 2017 11:44 AM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:

I think that the (or a) key problem is that the current "best practice" is 
treated as "SHOULD" in RFC parlance.  When what this really needs is a "MAY".



People reading standards tend to treat "SHOULD" and "MUST" as the same thing.

It's not that they "tend to", it's in RFC 2119:
SHOULD

 This word, or the adjective "RECOMMENDED", mean that there

   may exist valid reasons in particular circumstances to ignore a

   particular item, but the full implications must be understood and

   carefully weighed before choosing a different course.


The clear inference is that while the non-recommended practice is not 
prohibited, you better have some valid reason why you are deviating from it 
(and, reading between the lines, it would not hurt if you documented those 
reasons).



 So, when an implementation deviates, then you get bugs (as we see here).  
Given that there are very valid engineering reasons why someone might want to 
choose a different behavior for their needs - without harming the intent of the 
standard at all in most cases - I think the current/proposed language is too 
"strong".

Yes and no. ICU would be perfectly fine deviating from the existing 
recommendation and stating their engineering reasons for doing so. That would 
allow them to close their bug ("by documentation").

What's not OK is to take an existing recommendation and change it to something 
else, just to make bug reports go away for one implementations. That's like two 
sleepers fighting over a blanket that's too short. Whenever one is covered, the 
other is exposed.

If it is discovered that the existing recommendation is not based on anything 
like truly better behavior, there may be a case to change it to something 
that's equivalent to a MAY. Perhaps a list of nearly equally capable options.

(If that language is not in the standard already, a strong "an implementation 
MUST not depend on the use of a particular strategy for replacement of invalid 
code sequences", clearly ought to be added).

A./







-Shawn



-Original Message-

From: Alastair Houghton [mailto:alast...@alastairs-place.net]

Sent: Thursday, June 1, 2017 4:05 AM

To: Henri Sivonen <hsivo...@hsivonen.fi><mailto:hsivo...@hsivonen.fi>

Cc: unicode Unicode Discussion 
<unicode@unicode.org><mailto:unicode@unicode.org>; Shawn Steele 
<shawn.ste...@microsoft.com><mailto:shawn.ste...@microsoft.com>

Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8



On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode 
<unicode@unicode.org><mailto:unicode@unicode.org> wrote:



On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode

<unicode@unicode.org><mailto:unicode@unicode.org> wrote:

* As far as I can tell, there are two (maybe three) sane approaches to this 
problem:

   * Either a "maximal" emission of one U+FFFD for every byte that exists 
outside of a good sequence

   * Or a "minimal" version that presumes the lead byte was counting trail 
bytes correctly even if the resulting sequence was invalid.  In that case just 
use one U+FFFD.

   * And (maybe, I haven't heard folks arguing for this one) emit one 
U+FFFD at the first garbage byte and then ignore the input until valid data 
starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred 
garbage bytes as long as there weren't any valid sequences within that group).



I think it's not useful to come up with new rules in the abstract.



The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).



All of these approaches are explicitly allowed by the standard at present.  All 
three are reasonable, and each has its own pros and cons in a technical sense 
(leaving aside how prevalent the approach in question might be).  In a general 
purpose library I’d probably go for the second one; if I knew I was dealing 
with a potentially corrupt compressed or encrypted stream,

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
I think that the (or a) key problem is that the current "best practice" is 
treated as "SHOULD" in RFC parlance.  When what this really needs is a "MAY".

People reading standards tend to treat "SHOULD" and "MUST" as the same thing.  
So, when an implementation deviates, then you get bugs (as we see here).  Given 
that there are very valid engineering reasons why someone might want to choose 
a different behavior for their needs - without harming the intent of the 
standard at all in most cases - I think the current/proposed language is too 
"strong".

-Shawn

-Original Message-
From: Alastair Houghton [mailto:alast...@alastairs-place.net] 
Sent: Thursday, June 1, 2017 4:05 AM
To: Henri Sivonen <hsivo...@hsivonen.fi>
Cc: unicode Unicode Discussion <unicode@unicode.org>; Shawn Steele 
<shawn.ste...@microsoft.com>
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode <unicode@unicode.org> wrote:
> 
> On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode 
> <unicode@unicode.org> wrote:
>> * As far as I can tell, there are two (maybe three) sane approaches to this 
>> problem:
>>* Either a "maximal" emission of one U+FFFD for every byte that 
>> exists outside of a good sequence
>>* Or a "minimal" version that presumes the lead byte was counting 
>> trail bytes correctly even if the resulting sequence was invalid.  In that 
>> case just use one U+FFFD.
>>* And (maybe, I haven't heard folks arguing for this one) emit one 
>> U+FFFD at the first garbage byte and then ignore the input until valid data 
>> starts showing up again.  (So you could have 1 U+FFFD for a string of a 
>> hundred garbage bytes as long as there weren't any valid sequences within 
>> that group).
> 
> I think it's not useful to come up with new rules in the abstract.

The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).

All of these approaches are explicitly allowed by the standard at present.  All 
three are reasonable, and each has its own pros and cons in a technical sense 
(leaving aside how prevalent the approach in question might be).  In a general 
purpose library I’d probably go for the second one; if I knew I was dealing 
with a potentially corrupt compressed or encrypted stream, I might well plump 
for the third.  I can even *imagine* there being circumstances under which I 
might choose the first for some reason, in spite of my preference for the 
second approach.

I don’t think it makes sense to standardise on *one* of these approaches, so if 
what you’re saying is that the “Best Practice” has been treated as if it was 
part of the specification (and I think that *is* essentially your claim), then 
I’m in favour of either removing it completely, or (better) replacing it with 
Shawn’s suggestion - i.e. listing three reasonable approaches and telling 
developers to document which they take and why.

Kind regards,

Alastair.

--
http://alastairs-place.net




RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> And *that* is what the specification says.  The whole problem here is that 
> someone elevated
> one choice to the status of “best practice”, and it’s a choice that some of 
> us don’t think *should*
> be considered best practice.

> Perhaps “best practice” should simply be altered to say that you *clearly 
> document* your behavior
> in the case of invalid UTF-8 sequences, and that code should not rely on the 
> number of U+FFFDs 
> generated, rather than suggesting a behaviour?

That's what I've been suggesting.

I think we could maybe go a little further though:

* Best practice is clearly not to depend on the # of U+FFFDs generated by 
another component/app.  Clearly that can't be relied upon, so I think everyone 
can agree with that.
* I think encouraging documentation of behavior is cool, though there are 
probably low priority bugs and people don't like to read the docs in that 
detail, so I wouldn't expect very much from that.
* As far as I can tell, there are two (maybe three) sane approaches to this 
problem:
* Either a "maximal" emission of one U+FFFD for every byte that exists 
outside of a good sequence 
* Or a "minimal" version that presumes the lead byte was counting trail 
bytes correctly even if the resulting sequence was invalid.  In that case just 
use one U+FFFD.
* And (maybe, I haven't heard folks arguing for this one) emit one 
U+FFFD at the first garbage byte and then ignore the input until valid data 
starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred 
garbage bytes as long as there weren't any valid sequences within that group).
* I'd be happy if the best practice encouraged one of those two (or maybe 
three) approaches.  I think an approach that called rand() to see how many 
U+FFFDs to emit when it encountered bad data is fair to discourage.

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> it’s more meaningful for whoever sees the output to see a single U+FFFD 
> representing 
> the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid 
> lead byte and 
> then another for an “unexpected” trailing byte.

I disagree.  It may be more meaningful for some applications to have a single 
U+FFFD representing an illegally encoded 2-byte NULL than to have 2 U+FFFDs.  
Of course then you don't know if it was an illegally encoded 2-byte NULL or an 
illegally encoded 3-byte NULL or whatever, so some information that other 
applications may be interested in is lost.

Personally, I prefer the "emit a U+FFFD if the sequence is invalid, drop the 
byte, and try again" approach.  

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> For implementations that emit FFFD while handling text conversion and repair 
> (ie, converting ill-formed
> UTF-8 to well-formed), it is best for interoperability if they get the same 
> results, so that indices within the
> resulting strings are consistent across implementations for all the correct 
> characters thereafter.

That seems optimistic :)

If interoperability is the goal, then it would seem to me that changing the 
recommendation would be contrary to that goal.  There are systems that will not 
or cannot change to a new recommendation.  If such systems are updated, then 
adoption of those systems will likely take some time.

In other words, I cannot see where “consistency across implementations” would 
be achievable anytime in the near future.

It seems to me that being able to use a data stream of ambiguous quality in 
another application with predictable results, then that stream should be 
“repaired” prior to being handed over.  Then both endpoints would be using the 
same set of FFFDs, whether that was single or multiple forms.


-Shawn


RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> > In either case, the bad characters are garbage, so neither approach is 
> > "better" - except that one or the other may be more conducive to the 
> > requirements of the particular API/application.

> There's a potential issue with input methods that indirectly edit the backing 
> store.  For example,
> GTK input methods (e.g. function gtk_im_context_delete_surrounding()) can 
> delete an amount 
> of text specified in characters, not storage units.  (Deletion by storage 
> units is not available in this
> interface.)  This might cause utter confusion or worse if the backing store 
> starts out corrupt. 
> A corrupt backing store is normally manually correctable if most of the text 
> is ASCII.

I think that's sort of what I said: some approaches might work better for some 
systems and another approach might work better for another system.  This also 
presupposes a corrupt store.

It is unclear to me what the expected behavior would be for this corruption if, 
for example, there were merely a half dozen 0x80 in the middle of ASCII text?  
Is that garbage a single "character"?  Perhaps because it's a consecutive 
string of bad bytes?  Or should it be 6 characters since they're nonsense?  Or 
maybe 2 characters because the maximum # of trail bytes we can have is 3?

What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes?

I can see how different implementations might be able to come up with "rules" 
that would help them navigate (or clean up) those minefields, however it is not 
at all clear to me that there is a "best practice" for those situations.

There also appears to be a special weight given to non-minimally-encoded 
sequences.  It would seem to me that none of these illegal sequences should 
appear in practice, so we have either:

* A bad encoder spewing out garbage (overlong sequences)
* Flipped bit(s) due to storage/transmission/whatever errors
* Lost byte(s) due to storage/transmission/coding/whatever errors
* Extra byte(s) due to whatever errors
* Bad string manipulation breaking/concatenating in the middle of sequences, 
causing garbage (perhaps one of the above 2 codeing errors).

Only in the first case, of a bad encoder, are the overlong sequences actually 
"real".  And that shouldn't happen (it's a bad encoder after all).  The other 
scenarios seem just as likely, (or, IMO, much more likely) than a badly 
designed encoder creating overlong sequences that appear to fit the UTF-8 
pattern but aren't actually UTF-8.

The other cases are going to cause byte patterns that are less "obvious" about 
how they should be navigated for various applications.

I do not understand the energy being invested in a case that shouldn't happen, 
especially in a case that is a subset of all the other bad cases that could 
happen.

-Shawn 



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence  
> as U+002F.

Sort of, maybe.  It was not legal for them to generate it though.  So you could 
kind of infer that it was not a legal sequence.

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> Which is to completely reverse the current recommendation in Unicode 9.0. 
> While I agree that this might help you fending off a bug report, it would 
> create chances for bug reports for Ruby, Python3, many if not all Web 
> browsers,...

& Windows & .Net

Changing the behavior of the Windows / .Net SDK is a non-starter.

> Essentially, "overlong" is a word like "dragon" or "ghost": Everybody knows 
> what it means, but everybody knows they don't exist.

Yes, this is trying to improve the language for a scenario that CANNOT HAPPEN.  
We're trying to optimize a case for data that implementations should never 
encounter.  It is sort of exactly like optimizing for the case where your data 
input is actually a dragon and not UTF-8 text.  

Since it is illegal, then the "at least 1 FFFD but as many as you want to emit 
(or just fail)" is fine.

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> I think nobody is debating that this is *one way* to do things, and that some 
> code does it.

Except that they sort of are.  The premise is that the "old language was 
wrong", and the "new language is right."  The reason we know the old language 
was wrong was that there was a bug filed against an implementation because it 
did not conform to the old language.  The response to the application bug was 
to change the standard's recommendation.

If this language is adopted, then the opposite is going to happen:  Bugs will 
be filed against applications that conform to the old recommendation and not 
the new recommendation.  They will say "your code could be better, it is not 
following the recommendation."  Eventually that will escalate to some level 
that it will need to be considered, however, regardless of the improvements, it 
will be a "breaking change".

Changing code from one recommendation to another will change behavior.  For 
applications or SDKs with enough visibility, that will break *someone* because 
that's how these things work.  For applications that choose not to change, in 
response to some RFP, someone's going to say "you don't fully conform to 
Unicode, we'll go with a different vendor."  Not saying that these things make 
sense, that's just the way the world works.

In some situations, one form is better, in some cases another form is better.  
If the intent is truly that there is not "one way to do things," then the 
language should reflect that.

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Shawn Steele via Unicode
So basically this came about because code got bugged for not following the 
"recommendation."   To fix that, the recommendation will be changed.  However 
then that is going to lead to bugs for other existing code that does not follow 
the new recommendation.

I totally get the forward/backward scanning in sync without decoding reasoning 
for some implementations, however I do not think that the practices that 
benefit those should extend to other applications that are happy with a 
different practice.

In either case, the bad characters are garbage, so neither approach is "better" 
- except that one or the other may be more conducive to the requirements of the 
particular API/application.

I really think the correct approach here is to allow any number of replacement 
characters without prejudice.  Perhaps with suggestions for pros and cons of 
various approaches if people feel that is really necessary.

-Shawn

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Karl Williamson 
via Unicode
Sent: Friday, May 26, 2017 2:16 PM
To: Ken Whistler 
Cc: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 05/26/2017 12:22 PM, Ken Whistler wrote:
> 
> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
>> The link provided about the PRI doesn't lead to the comments.
>>
> 
> PRI #121 (August, 2008) pre-dated the practice of keeping all the 
> feedback comments together with the PRI itself in a numbered directory 
> with the name "feedback.html". But the comments were collected 
> together at the time and are accessible here:
> 
> http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121
> 
> Also there was a separately submitted comment document:
> 
> http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt
> 
> And the minutes of the pertinent UTC meeting (UTC #116):
> 
> http://www.unicode.org/L2/L2008/08253.htm
> 
> The minutes simply capture the consensus to adopt Option #2 from PRI 
> #121, and the relevant action items.
> 
> I now return the floor to the distinguished disputants to continue 
> litigating history. ;-)
> 
> --Ken
> 
>

The reason this discussion got started was that in December, someone came to me 
and said the code I support does not follow Unicode best practices, and 
suggested I need to change, though no ticket (yet) has been filed.  I was 
surprised, and posted a query to this list about what the advantages of the new 
approach are.  There were a number of replies, but I did not see anything that 
seemed definitive.  After a month, I created a ticket in Unicode and Markus was 
assigned to research it, and came up with the proposal currently being debated.

Looking at the PRI, it seems to me that treating an overlong as a single 
maximal unit is in the spirit of the wording, if not the fine print. 
That seems to be borne out by Markus, even with his stake in ICU, supporting 
option #2.

Looking at the comments, I don't see any discussion of the effect of this on 
overlong treatments.  My guess is that the effect change was unintentional.

So I have code that handled overlongs in the only correct way possible when 
they were acceptable, and in the obvious way after they became illegal, and now 
without apparent discussion (which is very much akin to "flimsy reasons"), it 
suddenly was no longer "best practice".  And that change came "rather late in 
the game".  That this escaped notice for years indicates that the specifics of 
REPLACEMENT CHAR handling don't matter all that much.

To cut to the chase, I think Unicode should issue a Corrigendum to the effect 
that it was never the intent of this change to say that treating overlongs as a 
single unit isn't best practice.  I'm not sure this warrants a full-fledge 
Corrigendum, though.  But I believe the text of the best practices should 
indicate that treating overlongs as a single unit is just as acceptable as 
Martin's interpretation.

I believe this is pretty much in line with Shawn's position.  Certainly, a 
discussion of the reasons one might choose one interpretation over another 
should be included in TUS.  That would likely have satisfied my original query, 
which hence would never have been posted.



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Shawn Steele via Unicode
> If the thread has made one thing clear is that there's no consensus in the 
> wider community
> that one approach is obviously better. When it comes to ill-formed sequences, 
> all bets are off.
> Simple as that.

> Adding a "recommendation" this late in the game is just bad standards policy.

I agree.  I'm not sure what value this provides.  If someone thought it added 
value to discuss the pros and cons of implementing it one way and the other as 
MAY do this or MAY do that, I don't mind.  But I think both should be 
permitted, and neither should be encouraged with anything stronger than a MAY.

-Shawn




RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Shawn Steele via Unicode
+ the list, which somehow my reply seems to have lost.

> I may have missed something, but I think nobody actually proposed to change 
> the recommendations into requirements

No thanks, that would be a breaking change for some implementations (like mine) 
and force them to become non-complying or potentially break customer behavior.

I would prefer that both options be permitted, perhaps with a few words of 
advantages.

-Shawn




RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
> Faster ok, privided this does not break other uses, notably for  random 
> access within strings…

Either way, this is a “recommendation”.  I don’t see how that can provide for 
not-“breaking other uses.”  If it’s internal, you can do what you will, so if 
you need the 1:1 seeming parity, then you can do that internally.  But if 
you’re depending on other APIs/libraries/data source/whatever, it would seem 
like you couldn’t count on that.  (And probably shouldn’t even if it was a 
requirement rather than a recommendation).

I’m wary of the idea of attempting random access on a stream that is also 
manipulating the stream at the same time (decoding apparently).

The U+FFFD emitted by this decoding could also require a different # of bytes 
to reencode.  Which might disrupt the presumed parity, depending on how the 
data access was being handled.

-Shawn


RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
But why change a recommendation just because it “feels like”.  As you said, 
it’s just a recommendation, so if that really annoyed someone, they could do 
something else (eg: they could use a single FFFD).

If the recommendation is truly that meaningless or arbitrary, then we just get 
into silly discussions of “better” that nobody can really answer.

Alternatively, how about “one or more FFFDs?” for the recommendation?

To me it feels very odd to perhaps require writing extra code to detect an 
illegal case.  The “best practice” here should maybe be “one or more FFFDs, 
whatever makes your code faster”.

Best practices may not be requirements, but people will still take time to file 
bugs that something isn’t following a “best practice”.

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Markus Scherer 
via Unicode
Sent: Tuesday, May 16, 2017 11:37 AM
To: Alastair Houghton 
Cc: Philippe Verdy ; Henri Sivonen ; 
unicode Unicode Discussion ; Hans Åberg 

Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

Let me try to address some of the issues raised here.

The proposal changes a recommendation, not a requirement. Conformance applies 
to finding and interpreting valid sequences properly. This includes not 
consuming parts of valid sequences when dealing with illegal ones, as explained 
in the section "Constraints on Conversion Processes".

Otherwise, what you do with illegal sequences is a matter of what you think 
makes sense -- a matter of opinion and convenience. Nothing more.

I wrote my first UTF-8 handling code some 18 years ago, before joining the ICU 
team. At the time, I believe the ISO UTF-8 definition was not yet limited to 
U+10, and decoding overlong sequences and those yielding surrogate code 
points was regarded as a misdemeanor. The spec has been tightened up, but I am 
pretty sure that most people familiar with how UTF-8 came about would recognize 
 and  as single sequences.

I believe that the discussion of how to handle illegal sequences came out of 
security issues a few years ago from some implementations including valid 
single and lead bytes with preceding illegal sequences. Beyond the "Constraints 
on Conversion Processes", there was evidently also a desire to recommend how to 
handle illegal sequences.

I think that the current recommendation was an extrapolation of common practice 
for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for UTF-8, too, 
but "it feels like" (yes, that's the level of argument for stuff that doesn't 
really matter) not treating  and  as single sequences is 
"weird".

Why do we care how we carve up an illegal sequence into subsequences? Only for 
debugging and visual inspection. Maybe some process is using illegal, overlong 
sequences to encode something special (à la Java string serialization, 
"modified UTF-8"), and for that it might be convenient too to treat overlong 
sequences as single errors.

If you don't like some recommendation, then do something else. It does not 
matter. If you don't reject the whole input but instead choose to replace 
illegal sequences with something, then make sure the something is not nothing 
-- replacing with an empty string can cause security issues. Otherwise, what 
the something is, or how many of them you put in, is not very relevant. One or 
more U+FFFDs is customary.

When the current recommendation came in, I thought it was reasonable but didn't 
like the edge cases. At the time, I didn't think it was important to twiddle 
with the text in the standard, and I didn't care that ICU didn't exactly 
implement that particular recommendation.

I have seen implementations that clobber every byte in an illegal sequence with 
a space, because it's easier than writing an U+FFFD for each byte or for some 
subsequences. Fine. Someone might write a single U+FFFD for an arbitrarily long 
illegal subsequence; that's fine, too.

Karl Williamson sent feedback to the UTC, "In short, I believe the best 
practices are wrong." I think "wrong" is far too strong, but I got an action 
item to propose a change in the text. I proposed a modified recommendation. 
Nothing gets elevated to "right" that wasn't, nothing gets demoted to "wrong" 
that was "right".

None of this is motivated by which UTF is used internally.

It is true that it takes a tiny bit more thought and work to recognize a wider 
set of sequences, but a capable implementer will optimize successfully for 
valid sequences, and maybe even for a subset of those for what might be 
expected high-frequency code point ranges. Error handling can go into a slow 
path. In a true state table implementation, it will require more states but 
should not affect the performance of valid sequences.

Many years ago, I decided for ICU to add a small amount of slow-path 
error-handling code for more 

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
Regardless, it's not legal and hasn't been legal for quite some time.  
Replacing a hacked embedded "null" with FFFD is going to be pretty breaking to 
anything depending on that fake-null, so one or three isn't really going to 
matter.

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham via Unicode
Sent: Tuesday, May 16, 2017 10:58 AM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On Tue, 16 May 2017 17:30:01 +0000
Shawn Steele via Unicode <unicode@unicode.org> wrote:

> > Would you advocate replacing
> 
> >   e0 80 80
> 
> > with
> 
> >   U+FFFD U+FFFD U+FFFD (1)  
> 
> > rather than
> 
> >   U+FFFD   (2)  
> 
> > It’s pretty clear what the intent of the encoder was there, I’d say, 
> > and while we certainly don’t want to decode it as a NUL (that was 
> > the source of previous security bugs, as I recall), I also don’t see 
> > the logic in insisting that it must be decoded to *three* code 
> > points when it clearly only represented one in the input.
> 
> It is not at all clear what the intent of the encoder was - or even if 
> it's not just a problem with the data stream.  E0 80 80 is not 
> permitted, it's garbage.  An encoder can't "intend" it.

It was once a legal way of encoding NUL, just like C0 E0, which is still in 
use, and seems to be the best way of storing NUL as character content in a *C 
string*.  (Strictly speaking, one can't do it.)  It could be lurking in old 
text or come from an old program that somehow doesn't get used for U+0080 to 
U+07FF. Converting everything in UCS-2 to 3 bytes was an easily encoded way of 
converting UTF-16 to UTF-8.

Remember the conformance test for the Unicode Collation Algorithm has contained 
lone surrogates in the past, and the UAX on Unicode Regular Expressions used to 
require the ability to search for lone surrogates.

Richard.




RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
> Would you advocate replacing

>   e0 80 80

> with

>   U+FFFD U+FFFD U+FFFD (1)

> rather than

>   U+FFFD   (2)

> It’s pretty clear what the intent of the encoder was there, I’d say, and 
> while we certainly don’t 
> want to decode it as a NUL (that was the source of previous security bugs, as 
> I recall), I also don’t
> see the logic in insisting that it must be decoded to *three* code points 
> when it clearly only 
> represented one in the input.

It is not at all clear what the intent of the encoder was - or even if it's not 
just a problem with the data stream.  E0 80 80 is not permitted, it's garbage.  
An encoder can't "intend" it.

Either
A) the "encoder" was attempting to be malicious, in which case the whole thing 
is suspect and garbage, and so the # of FFFD's doesn't matter, or

B) the "encoder" is completely broken, in which case all bets are off, again, 
specifying the # of FFFD's is irrelevant.

C) The data was corrupted by some other means.  Perhaps bad concatenations, 
lost blocks during read/transmission, etc.  If we lost 2 512 byte blocks, then 
maybe we should have a thousand FFFDs (but how would we known?)

-Shawn



RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Shawn Steele via Unicode
>> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
>> multiple errors there makes no sense.
> 
> Changing a specification as fundamental as this is something that should not 
> be undertaken lightly.

IMO, the only think that can be agreed upon is that "something's bad with this 
UTF-8 data".  I think that whether it's treated as a single group of corrupt 
bytes or each individual byte is considered a problem should be up to the 
implementation.

#1 - This data should "never happen".  In a system behaving normally, this 
condition should never be encountered.  
  * At this point the data is "bad" and all bets are off.
  * Some applications may have a clue how the bad data could have happened and 
want to do something in particular.
  * It seems odd to me to spend much effort standardizing a scenario that 
should be impossible.
#2 - Depending on implementation, either behavior, or some combination, may be 
more efficient.  I'd rather allow apps to optimize for the common case, not the 
case-that-shouldn't-ever-happen
#3 - We have no clue if this "maximal" sequence was a single error, 2 errors, 
or even more.  The lead byte says how many trail bytes should follow, and those 
should be in a certain range.  Values outside of those conditions are illegal, 
so we shouldn't ever encounter them.  So if we did, then something really weird 
happened.  
  * Did a single character get misencoded?
  * Was an illegal sequence illegally encoded?
  * Perhaps a byte got corrupted in transmission?
  * Maybe we dropped a packet/block, so this is really the beginning of a valid 
sequence and the tail of another completely valid sequence?

In practice, all that most apps would be able to do would be to say "You have 
bad data, how bad I have no clue, but it's not right".  A single bit could've 
flipped, or you could have only 3 pages of a 4000 page document.  No clue at 
all.  At that point it doesn't really matter how many FFFD's the error(s) are 
replaced with, and no assumptions should be made about the severity of the 
error.

-Shawn



RE: how would you state requirements involving sorting?

2017-01-24 Thread Shawn Steele
That requirement will probably really annoy speakers of some languages.

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Eric Muller
Sent: Monday, January 23, 2017 10:44 PM
To: unicode@unicode.org
Subject: how would you state requirements involving sorting?

Suppose you help somebody write requirements for a piece of software and you 
see an item:
Sorting. Diacritic marks need to be stripped when sorting titles

You know that sorting is a lot more complicated than removing diacritics, and 
that giving the directive above to a naive developer is going to lead to 
trouble. You know you want to end up with an implementation involving the UCA 
with a tailoring based on the locale. How would you suggest to reword the 
requirement?

Thanks,
Eric.


RE: The (Klingon) Empire Strikes Back

2016-11-10 Thread Shawn Steele
More generally, does that mean that alphabets with perceived owners will only 
be considered for encoding with permission from those owner(s)?  What if the 
ownership is ambiguous or unclear?

Getting permission may be a lot of work, or cost money, in some cases.  Will 
applications be considered pending permission, perhaps being provisionally 
approved until such permission is received?

Is there specific language that Unicode would require from owners to be 
comfortable in these cases?  It makes little sense for a submitter to go 
through a complex exercise to request permission if Unicode is not comfortable 
with the wording of the permission that is garnered.  Are there other such 
agreements that could perhaps be used as templates?

Historically, the message pIqaD supporters have heard from Unicode has been 
that pIqaD is a toy script that does not have enough use.  The new proposal 
attempts to respond to those concerns, particularly since there is more 
interest in the script now.  Now, additional (valid) concerns are being raised.

In Mark’s case it seems like it would be nice if Unicode could consider the 
rest of the proposal and either tentatively approve it pending Paramount’s 
approval, or to provide feedback as to other defects in the proposal that would 
need addressed for consideration.  Meanwhile Mark can figure out how to get 
Paramount’s agreement.

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Peter Constable
Sent: Wednesday, November 9, 2016 8:49 PM
To: Mark E. Shoulson ; David Faulks 
Cc: Unicode Mailing List 
Subject: RE: The (Klingon) Empire Strikes Back

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Mark E. Shoulson
Sent: Friday, November 4, 2016 1:18 PM
> At any rate, this isn't Unicode's problem…

You saying that potential IP issues are not Unicode’s problem does not in fact 
make it not a problem. A statement in writing from authorized Paramount 
representatives stating it would not be a problem for either Unicode, its 
members or implementers of Unicode would make it not a problem for Unicode.



Peter


RE: The (Klingon) Empire Strikes Back

2016-11-07 Thread Shawn Steele
I guess for this thread I should subscribe to the list with a personal email 
address.  Please don’t confuse my personal and professional opinions here ;)  
(Of course I’ll probably confuse them myself).





Personally, as myself, no Microsoft hat, I would be interested to see the base 
characters encoded, excluding the “mummification glyph” and your 2 created 
characters.  The mummification glyph seems decorative and I haven’t seen the 
others in use.  I would include the pIqaD comma and full stop, they seem to 
have fairly consistent use.  Their meaning is also more specific than the 
triangle glyph suggestions you mentioned as possible alternatives.  Since these 
are used in plaintext conversations and not merely as decoration, I think that 
attempting to overload the meaning of the non-pIqaD triangle glyphs would be 
inappropriate.

The enthusiasts using pIqaD, and the businesses targeting that community, have, 
in my opinion, reached a level of adoption that requires proper Unicode 
encoding to make further progress.  The current ConScript PUA practice is a 
decent hack to get things to work, but in practice there can be strange 
behaviors, particularly in more advanced aspects of character behavior.  Like 
the fact that the PUA range doesn’t properly describe the character properties 
of these letters and digits.

For example, Qurgh and others figured out how to get pIqaD to behave in 
Facebook posts.  The current Klingon word of the day posts include the pIqaD 
spelling, and some discussion happens in pIqaD as well.  However getting it all 
to behave is unnecessarily awkward given some of the current restrictions 
requiring using the PUA for pIqaD.

Mark, you missed that pIqaD has an ISO script code now (Piqd).  That might be 
worth mentioning.  The PUA encoding makes it difficult or hacky to integrate 
some features for the Piqd script in computing libraries, such as digit 
conversion routines.





Professionally, I’m not sure if Microsoft has a current position on pIqaD.  As 
noted by Mark, the Bing Translator allows the use of pIqaD (tlh-Piqd), both for 
input and output.  I chose to use the ConScript PUA for that feature.  Had the 
pIqaD script been included in Unicode, we would have used the assigned Unicode 
codepoints instead of the ConScript PUA.

-Shawn

 
http://blogs.msdn.com/shawnste
http://bb-8.blogspot.com

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Mark Shoulson
Sent: ,  03,  2016 16:44
To: unicode@unicode.org
Subject: The (Klingon) Empire Strikes Back


At the time of writing this letter it has not yet hit the UTC Document 
Register, but I have recently submitted a document revisiting the ever-popular 
issue of the encoding of Klingon "pIqaD".  The reason always given why it could 
not be encoded was that it did not enjoy enough usage, and so I've collected a 
bunch of examples to demonstrate that this is not true (scans and also web 
pages, etc.)  So the issue comes back up, and time to talk about it again.

Michael Everson: I basically copied your 1997 proposal into the document, with 
some minor changes.  I hope you don't mind.  And if you don't want to be on the 
hook for providing the glyphs to UTC, I can do that.  I think that proposal 
should serve as a starting-point for discussion anyway.  There are some things 
that maybe should be different:

1. the "SYMBOL FOR EMPIRE" also known as the "MUMMIFICATION GLYPH".  I don't 
know where the second name comes from, I don't know how important it is to 
encode it, and I don't know how much of a trademark headache it will cause with 
Paramount, as it is used pretty heavily in their imagery.  Something we'll have 
to talk about.

2. I put in the COMMA and FULL STOP, which were not in the original proposal 
but were in the ConScript registry entry.  The examples I have show them 
clearly being used.  UTC may decide to unify them with existing triangular 
shapes, which may or may not be a good idea.

3. For my part, I've invented a pair of ampersands for Klingon (Klingon has two 
words for "and": one for joining verbs/sentences and one for joining nouns (the 
former goes between its "conjunctands", the latter after them)), from ligatures 
of the letters in question.  The pretty much have NO usage, of course (and are 
not in the proposal), but maybe they should be presented to the community.

Document is available at http://web.meson.org/downloads/pIqaDReturns.pdf

Let the bickering begin!

~mark


RE: UTC makes the Colbert show

2016-03-30 Thread Shawn Steele
He has suggestions for process improvements as well…

From: Unicore [mailto:unicore-boun...@unicode.org] On Behalf Of Jennifer 8. Lee
Sent: Wednesday, March 30, 2016 10:37 AM
To: Mark Davis ☕️ 
Cc: UTC ; Unicode Public 
Subject: Re: UTC makes the Colbert show

He cites you by title!

On Wednesday, March 30, 2016, Mark Davis ☕️ 
> wrote:
Fredrik passed this on:
https://www.youtube.com/watch?v=CfZE56E0Uts ; skip ahead to 1:30.

Mark


RE: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

2016-01-31 Thread Shawn Steele
It should be understood that any algorithm that changes the Unicode character 
data to non-character data is therefore binary, and not Unicode.  It's 
inappropriate to shove binary data into unicode streams because stuff will 
break.
https://blogs.msdn.microsoft.com/shawnste/2005/09/26/avoid-treating-binary-data-as-a-string/


-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Chris Jacobs
Sent: Sunday, January 31, 2016 10:08 AM
To: J Decker 
Cc: unicode@unicode.org
Subject: Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers



J Decker schreef op 2016-01-31 18:56:
> On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs 
> wrote:
>> 
>> 
>> J Decker schreef op 2016-01-31 03:28:
>>> 
>>> I've reconsidered and think for ease of implementation to just mask 
>>> every UTF-16 character (not  codepoint) with a 10 bit value, This 
>>> will result in no character changing from BMP space to 
>>> surrogate-pair or vice-versa.
>>> 
>>> Thanks for the feedback.
>> 
>> 
>> So you are still trying to handle the unarmed output as plaintext.
>> Do you realize that if a string in the output is replaced by a 
>> canonical equivalent one this may mess up things because the 
>> originals are not canonical equivalent?
>> 
> I see ... things like mentioned here
> http://websec.github.io/unicode-security-guide/character-transformatio
> ns/

Yes especially the part about normalization.
This would not only spoil the normalized string, but also, as the string can 
have a different length, for anything after that your ever-changing xor-values 
may go out of sync.





RE: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

2016-01-31 Thread Shawn Steele
Typically XOR’ing a constant isn’t really considered worth messing with.  It’s 
somewhat trivial to figure out the key to un-XOR.

On Sat, Jan 30, 2016, 6:31 PM J Decker 
<d3c...@gmail.com<mailto:d3c...@gmail.com>> wrote:
On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele
<shawn.ste...@microsoft.com<mailto:shawn.ste...@microsoft.com>> wrote:
> Why do you need illegal unicode code points?

This originated from learning Javascript; which is internally UTF-16.
Playing with localStorage, some browsers use a sqlite3 database to
store values.  The database is UTF-8 so there must be a valid
conversion between the internal UTF-16 and UTF-8 localStorage (and
reverse).  I wanted to obfuscate the data stored for a certain
application; and cover all content that someone might send.  Having
slept on this, I realized that even if hieroglyphics were stored, if I
pulled out the character using codePointAt() and applied a 20 bit
random value to it using XOR it could end up as a normal character,
and I wouldn't know I had to use a 20 bit value... so every character
would have to use a 20 bit mask (which could end up with a value
that's D800-DFFF).

I've reconsidered and think for ease of implementation to just mask
every UTF-16 character (not  codepoint) with a 10 bit value, This will
result in no character changing from BMP space to surrogate-pair or
vice-versa.

Thanks for the feedback.
(sorry if I've used some terms inaccurately)

>
> -Original Message-
> From: Unicode 
> [mailto:unicode-boun...@unicode.org<mailto:unicode-boun...@unicode.org>] On 
> Behalf Of J Decker
> Sent: Saturday, January 30, 2016 6:40 AM
> To: unicode@unicode.org<mailto:unicode@unicode.org>
> Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
>
> I do see that the code points D800-DFFF should not be encoded in any UTF 
> format (UTF8/32)...
>
> UTF8 has a way to define any byte that might otherwise be used as an encoding 
> byte.
>
> UTF16 has no way to define a code point that is D800-DFFF; this is an issue 
> if I want to apply some sort of encryption algorithm and still have the 
> result treated as text for transmission and encoding to other string systems.
>
> http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
> private areas Area-A which is U-F:U-D and Area-B which is 
> U-10:U-10FFFD which will suffice for a workaround for my purposes
>
> For my purposes I will implement F-F0800 to be (code point minus
> D800 and then add F (or vice versa)) and then encoded as a surrogate 
> pair... it would have been super nice of unicode standards included a way to 
> specify code point even if there isn't a language character assigned to that 
> point.
>
> http://unicode.org/faq/utf_bom.html
> does say: "Q: Are there any 16-bit values that are invalid?
>
> A: Unpaired surrogates are invalid in UTFs. These include any value in the 
> range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any 
> value in the range DC00 to DFFF not preceded by a value in the range D800 to 
> DBFF "
>
> and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
>
> A different issue arises if an unpaired surrogate is encountered when 
> converting ill-formed UTF-16 data. By represented such an unpaired surrogate 
> on its own as a 3-byte sequence, the resulting UTF-8 data stream would become 
> ill-formed. While it faithfully reflects the nature of the input, Unicode 
> conformance requires that encoding form conversion always results in valid 
> data stream. Therefore a converter must treat this as an error. "
>
>
>
> I did see these older messages... (not that they talk about this much just 
> more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html


RE: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

2016-01-30 Thread Shawn Steele
Why do you need illegal unicode code points?

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of J Decker
Sent: Saturday, January 30, 2016 6:40 AM
To: unicode@unicode.org
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

I do see that the code points D800-DFFF should not be encoded in any UTF format 
(UTF8/32)...

UTF8 has a way to define any byte that might otherwise be used as an encoding 
byte.

UTF16 has no way to define a code point that is D800-DFFF; this is an issue if 
I want to apply some sort of encryption algorithm and still have the result 
treated as text for transmission and encoding to other string systems.

http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
private areas Area-A which is U-F:U-D and Area-B which is 
U-10:U-10FFFD which will suffice for a workaround for my purposes

For my purposes I will implement F-F0800 to be (code point minus
D800 and then add F (or vice versa)) and then encoded as a surrogate 
pair... it would have been super nice of unicode standards included a way to 
specify code point even if there isn't a language character assigned to that 
point.

http://unicode.org/faq/utf_bom.html
does say: "Q: Are there any 16-bit values that are invalid?

A: Unpaired surrogates are invalid in UTFs. These include any value in the 
range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any 
value in the range DC00 to DFFF not preceded by a value in the range D800 to 
DBFF "

and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?

A different issue arises if an unpaired surrogate is encountered when 
converting ill-formed UTF-16 data. By represented such an unpaired surrogate on 
its own as a 3-byte sequence, the resulting UTF-8 data stream would become 
ill-formed. While it faithfully reflects the nature of the input, Unicode 
conformance requires that encoding form conversion always results in valid data 
stream. Therefore a converter must treat this as an error. "



I did see these older messages... (not that they talk about this much just more 
info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html



RE: Unicode in the Curriculum?

2016-01-06 Thread Shawn Steele
Then it should be UTF-8.  Learning to do something in a non-Unicode code page 
and then redoing it for UTF-8 or UTF-16 merely leads to conversion problems, 
incompatibilities, and other nonsense.

If someone “needs” to not use UTF-16 for whatever reason, then they should use 
UTF-8.  The “advanced” training should be the other non-Unicode code pages.

Teach them right the first time.  They’ll never use a code page.

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Asmus Freytag 
(t)
Sent: January 6, 2016 3:19 PM
To: unicode@unicode.org
Subject: Re: Unicode in the Curriculum?

On 1/6/2016 10:59 AM, Shawn Steele wrote:

+1  :)

I'm not going to join the happy chorus here.

The "bunny" slope for most people is their own native language...

A./






-Original Message-

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Ken Whistler

Sent: Wednesday, January 6, 2016 7:44 AM

To: Andre Schappo <a.scha...@lboro.ac.uk><mailto:a.scha...@lboro.ac.uk>

Cc: unicode@unicode.org<mailto:unicode@unicode.org>

Subject: Re: Unicode in the Curriculum?



Actually, ASCII should *not* be ignored or deprecated.



We *love* ASCII. The issue is just making sure that students understand that 
the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values 
that open into the entire world of Unicode characters.



It is a mind trick to play on young programmers: when you learn "ASCII", you 
are just playing on the bunny slope at the UTF-8 ski resort. Slap on your 
snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with 
the experts!



--Ken



On 1/6/2016 4:09 AM, Andre Schappo wrote:

On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote:



ASCII shouldn't be taught, perhaps?

I really like the idea of questioning whether or not ASCII should even be 
taught.



Wherever in a programming curriculum, text 
processing/transmission/storage/presentation/encoding is taught, then it should 
be Unicode text.



ASCII, along with, ISO-8859 ISO-2022 GB2312  .etc. should be consigned

to



.and finally, the legacy character sets/encodings...



Maybe ASCII should now be flagged as deprecated

https://twitter.com/andreschappo/status/684706421712228352



André Schappo

















RE: Unicode in the Curriculum?

2016-01-06 Thread Shawn Steele
Ø  I think any training in non-Unicode character sets is beyond a standard 
curriculum, except perhaps History of Computing or Digital Archaeology  :)

One could only hope.


RE: Unicode in the Curriculum?

2016-01-06 Thread Shawn Steele
+1  :)  

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Ken Whistler
Sent: Wednesday, January 6, 2016 7:44 AM
To: Andre Schappo 
Cc: unicode@unicode.org
Subject: Re: Unicode in the Curriculum?

Actually, ASCII should *not* be ignored or deprecated.

We *love* ASCII. The issue is just making sure that students understand that 
the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values 
that open into the entire world of Unicode characters.

It is a mind trick to play on young programmers: when you learn "ASCII", you 
are just playing on the bunny slope at the UTF-8 ski resort. Slap on your 
snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with 
the experts!

--Ken

On 1/6/2016 4:09 AM, Andre Schappo wrote:
> On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote:
>
>> ASCII shouldn't be taught, perhaps?
> I really like the idea of questioning whether or not ASCII should even be 
> taught.
>
> Wherever in a programming curriculum, text 
> processing/transmission/storage/presentation/encoding is taught, then it 
> should be Unicode text.
>
> ASCII, along with, ISO-8859 ISO-2022 GB2312  .etc. should be consigned 
> to
>
> .and finally, the legacy character sets/encodings...
>
> Maybe ASCII should now be flagged as deprecated 
> https://twitter.com/andreschappo/status/684706421712228352
>
> André Schappo
>
>
>
>




RE: crafting emoji

2015-10-24 Thread Shawn Steele
Seeing the title first I read “crafting” as a verb and thought you wanted to 
knit some ☺

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Molly Black
Sent: Friday, October 23, 2015 7:34 PM
To: unicode@unicode.org
Subject: crafting emoji

Why is there no knitting needles, yarn, sewing needle with thread or sewing 
machine in the emoji library?  Has this been discussed already?


RE: [somewhat off topic] straw poll

2015-09-10 Thread Shawn Steele
Q1 I ignore threads that aren’t of interest (outlook even has a handy “ignore 
thread” button - though lists like this tend to break it)
Q2 If they get too annoying and don’t have useful content, then I make a rule 
to send that person’s mail to the trashcan. I include their name in the body to 
catch replies as well.
Q3 If there were too many of those folks, then I’d have more rules.

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Frédéric 
Grosshans
Sent: Thursday, September 10, 2015 11:21 AM
To: Peter Constable ; Unicode Mailing List 

Subject: Re: [somewhat off topic] straw poll


Q1: neutral
Q2: annoying
Q3: reducing value of the list for me

Le jeu. 10 sept. 2015 20:10, Peter Constable 
> a écrit :
I was having an offline discussion with someone regarding certain topics that 
may show up on this list on occasion, and the question came up of what evidence 
we might have of sentiment on the list. So, I thought I’d conduct a simple 
straw poll — respond if you feel inclined.

The questions are framed around this hypothetical scenario: Suppose I were to 
post a message to the list describing some experiment I did, creating a Web 
page containing (say) some Latin characters — not obscure, 
just-added-in-Unicode-8 characters, but ones that have been in the standard for 
some time; that my process for creating the file was to use (say) Notepad and 
entering HTML numeric character references; and that my findings were that it 
worked.

Q1: Would you find that to be an interesting post that adds makes your 
participation in the list more useful, or would you find it a noisy distraction 
that reduces the value you get from participating in the list?

Q2: If I were to send messages along that line on a regular basis, would that 
add value to your participation in the list, or reduce it?

Q3: If 50 people (still a small portion of the list membership) were to send 
messages along that line on a regular basis, would that add value to your 
participation in the list, or reduce it?



Peter



RE: Dark beer emoji

2015-09-03 Thread Shawn Steele
If we have a bunch of ingredients emoji, then do yeast + grain + hops emoji 
combine into beer emoji?




RE: Dark beer emoji

2015-09-01 Thread Shawn Steele
Thanks

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Michael Everson
Sent: Tuesday, September 1, 2015 10:40 AM
To: Unicode Mailing List <unicode@unicode.org>
Subject: Re: Dark beer emoji

On 1 Sep 2015, at 18:29, Shawn Steele <shawn.ste...@microsoft.com> wrote:
> 
> Ugh, should've encoded that Martian green skin-tone.  Then we'd've been 
> prepared for St. Patty's Day beers.

Recte: St. Paddy’s Day

Michael Everson * http://www.evertype.com/





RE: Dark beer emoji

2015-09-01 Thread Shawn Steele
Ugh, should've encoded that Martian green skin-tone.  Then we'd've been 
prepared for St. Patty's Day beers.

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Doug Ewell
Sent: Tuesday, September 1, 2015 9:37 AM
To: Unicode Mailing List 
Subject: Dark beer emoji

Document L2/15-211, "Letter in support of dark beer emoji"
, is a request 
submitted by Cuauhtémoc Moctezuma, a Mexican brewery.

The letter refers to a petition with more than 22,000 signatures supporting 
such an emoji, and may have at least some commercial motivation ("We want the 
dark beer to be part of peoples conversations").

As an alternative to this proposal that may provide more flexibility, I propose 
adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to
U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B
CLINKING BEER MUGS.

This could be done by establishing a normative correlation between the 
Fitzpatrick scale and the Standard Reference Method (SRM), Lovibond, and/or 
European Brewery Convention (EBC) beer color scales 
.

This mechanism would allow the entire spectrum of beer styles to be depicted, 
instead of dividing beers arbitrarily into "light" and "dark,"
in the same way (and for the same reason) that Unicode already supports a 
variety of skin tones.

For example, a Budweiser or similar lager could be represented as
 <1F37A, 1F3FB>, while a Newcastle Brown Ale might be 
<1F37A, 1F3FD>. U+1F3FF could denote imperial stout or Baltic porter.
There might be a need to encode an additional "Type 0" color modifier to extend 
the "light" end of the scale, such as for non-alcoholic brews, or for Coors 
Light.

U+1F37B could be used to denote two beers of the same style, but for
beers of different colors, the mechanism described in UTR #51, Section
2.2.1 ("Multi-Person Groupings"), involving ZWJ, could be utilized. So a toast 
between drinkers of the two beers above could be encoded as
‍ <1F37A, 1F3FB, 200D, 1F37A, 1F3FD>. Longer sequences would also be 
possible, such as for beer samplers offered in some pubs and restaurants.

I have no idea whether my proposal is more or less serious, or more or less 
likely to be adopted, than the original.

--
Doug Ewell | http://ewellic.org | Thornton, CO 





RE: Dark beer emoji

2015-09-01 Thread Shawn Steele
It's my birthday, so I knew it wasn't April. :)

It'd be a fun font easter egg though...

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Karl Williamson
Sent: Tuesday, September 1, 2015 11:37 AM
To: Doug Ewell ; Unicode Mailing List 
Subject: Re: Dark beer emoji

On 09/01/2015 10:37 AM, Doug Ewell wrote:
> I have no idea whether my proposal is more or less serious, or more or 
> less likely to be adopted, than the original.

When I read this, I wondered if it was April 1 instead of September 1.



RE: Dark beer emoji

2015-09-01 Thread Shawn Steele
In one version the beer is inside the glass, in the other, the beer is outside 
the glass.

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham
Sent: Tuesday, September 1, 2015 1:43 PM
To: Unicode Mailing List 
Subject: Re: Dark beer emoji

On Tue, 01 Sep 2015 11:13:13 -0700
"Doug Ewell"  wrote:

> Asmus Freytag (t)  wrote:
> 
> > Well, you didn't consider that each style of beer may be served in a 
> > different style glass. :)
> 
> Yay, emoji modifier chaining:
> 
> U+1F37A BEER MUG
> U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 1Fxxx EMOJI MODIFIER 
> U+WEIZEN GLASS

How is that to be equated to ?  Or is some rendering 
difference to be expected?

Richard.



RE: a suggestion new emoji .

2015-08-18 Thread Shawn Steele
I'm sure Klingons love them!

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Mark E. Shoulson
Sent: Tuesday, August 18, 2015 5:53 PM
To: unicode@unicode.org
Subject: Re: a suggestion new emoji .

On 08/18/2015 07:20 PM, Emma Haneys wrote:
 hello dear unicode , i just wondering if i can suggest a new emoji .
 hoppefully you can respone to me . i suggest one and only for fruit 
 category . it is a durian .  thanx
Ah, durians.  Kind of a cross between food and weaponry.

~mark



RE: Revenge of pIqaD

2015-07-28 Thread Shawn Steele
You missed Bing translate?  
http://www.bing.com/translator/?from=ento=tlh-Qaaktext=Success

- Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Mark Shoulson
Sent: Tuesday, July 28, 2015 7:21 PM
To: unicode@unicode.org; Chris Lipscombe qu...@wizage.net
Subject: Revenge of pIqaD

OK!  I'm freshly back from the qep'a' cha'maH cha'DIch in Chicago, and I have 
to report that Klingon pIqaD really is out there and getting some use, despite 
having been banished to the PUA.  I've seen it on a wine-bottle label 
(commercially produced, not someone's homebrew), on the Klingon version of the 
Monopoly game, a book or two (NOT published by the KLI); there are websites 
using it (but then there were last time I mentioned this and that didn't seem 
to count then), and apparently support for it on several platforms, including a 
smartphone keypad, to say nothing of quite a few T-shirts.  Apparently there is 
a small community actually using pIqaD to (*gasp*) exchange information via 
SMS.  I'm copying Chris Lipscombe on this email; he is better plugged in to the 
use of pIqaD in Real Life™ (don't forget to Reply All if you want to include 
him, since I think he isn't on the list at the moment).

What has to be done to get this encoded?  The proposal is likely still more or 
less what we need, and it probably has at least as much online information 
interchange as, say, Gondi does (Well, what do you expect, Gondi isn't encoded 
yet! Neither is pIqaD.)  Are we ready to revisit this question again?

~mark


RE: Revenge of pIqaD

2015-07-28 Thread Shawn Steele
Ooo, I forgot that means everything is in pIqaD!  
http://www.microsofttranslator.com/bv.aspx?from=ento=tlh-Qaaka=http%3A%2F%2Fwww.cnn.com%2F

From: Shawn Steele
Sent: Tuesday, July 28, 2015 7:50 PM
To: 'Mark Shoulson' m...@kli.org; unicode@unicode.org; Chris Lipscombe 
qu...@wizage.net
Subject: RE: Revenge of pIqaD

You missed Bing translate?  
http://www.bing.com/translator/?from=ento=tlh-Qaaktext=Success

- Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Mark Shoulson
Sent: Tuesday, July 28, 2015 7:21 PM
To: unicode@unicode.orgmailto:unicode@unicode.org; Chris Lipscombe 
qu...@wizage.netmailto:qu...@wizage.net
Subject: Revenge of pIqaD

OK!  I'm freshly back from the qep'a' cha'maH cha'DIch in Chicago, and I have 
to report that Klingon pIqaD really is out there and getting some use, despite 
having been banished to the PUA.  I've seen it on a wine-bottle label 
(commercially produced, not someone's homebrew), on the Klingon version of the 
Monopoly game, a book or two (NOT published by the KLI); there are websites 
using it (but then there were last time I mentioned this and that didn't seem 
to count then), and apparently support for it on several platforms, including a 
smartphone keypad, to say nothing of quite a few T-shirts.  Apparently there is 
a small community actually using pIqaD to (*gasp*) exchange information via 
SMS.  I'm copying Chris Lipscombe on this email; he is better plugged in to the 
use of pIqaD in Real Life™ (don't forget to Reply All if you want to include 
him, since I think he isn't on the list at the moment).

What has to be done to get this encoded?  The proposal is likely still more or 
less what we need, and it probably has at least as much online information 
interchange as, say, Gondi does (Well, what do you expect, Gondi isn't encoded 
yet! Neither is pIqaD.)  Are we ready to revisit this question again?

~mark


RE: Bunny hill symbol, used in America for signaling ski pistes for novices

2015-05-30 Thread Shawn Steele
I’m really curious to see one of these signs.  Is it a regional thing?

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Leonardo Boiko
Sent: Thursday, May 28, 2015 1:02 PM
To: Philippe Verdy
Cc: unicode Unicode Discussion
Subject: Re: Bunny hill symbol, used in America for signaling ski pistes for 
novices

You could use U+1F407 RABBIT combined with U+20E4 COMBINING ENCLOSING UPWARD 
POINTING TRIANGLE, and pretend the triangle is a hill.   ⃤
If only we had a combining rabbit, we could add rabbits to U+1F3D4 SNOW CAPPED 
MOUNTAIN.  Or anything else.

2015-05-28 16:46 GMT-03:00 Philippe Verdy 
verd...@wanadoo.frmailto:verd...@wanadoo.fr:
Is there a symbol that can represent the Bunny hill symbol used in North 
America and some other American territories with mountains, to designate the 
ski pistes open to novice skiers (those pistes are signaled with green signs in 
Europe).

I'm looking for the symbol itself, not the color, or the form of the sign.

For example blue pistes in Europe are designed with a green circle in America, 
but we have a symbol for the circle; red pistes in Europe are signaled by a 
blue square in America, but we have a symbol for the square; black pistes in 
Europe are signaled by a black diamond in America, but we also have such 
black diamond in Unicode.

But I can't find an equivalent to the American Bunny hill signal, equivalent 
to green pistes in Europe (this is a problem for webpages related to skiing: do 
we have to embed an image ?).




RE: Re: Bunny hill symbol, used in America for signaling ski pistes for novices

2015-05-30 Thread Shawn Steele
I guess it depends on what you’re representing.  If it is the concept of 
“double black”, then maybe a separate symbol and the “font” or other selectors 
determine if it’s vertically or horizontally rendered.

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy
Sent: Saturday, May 30, 2015 2:56 PM
To: Jörg Knappen
Cc: Shervin Afshar; unicode Unicode Discussion
Subject: Re: Re: Bunny hill symbol, used in America for signaling ski pistes 
for novices

But observations show that the vertical stacking is not universal. Horizontal 
stacking is also used in direction signs. My opinion is that they are just two 
separate diamonds and not a single symbol.

Quite equivalent to the situation with the classification of hotels with stars 
(generally aligned horizontally but not always, we can see them also arranged 
vertically, or on two rows 1+1, 1+2 or 2+1 or 2+3 or 3+2...)

I don't think the exact layout of individual symbols (diamond, star, ...) is 
semantically significant, only their number is important  (and the fact they 
are grouped together on the same medium with the same foreground/background 
colors or tecturing and the same sizes).

2015-05-29 9:32 GMT+02:00 Jörg Knappen 
jknap...@web.demailto:jknap...@web.de:
From the description of the symbol it looks like a geometric shape. I think it 
is worth to be encoded as a geometric shape (TWO BLACK DIAMONDS VERTICALLY 
STACKED or something like this) with a note * bunny hill. It may have (r find 
in future) other uses.

--Jörg Knappen

Gesendet: Donnerstag, 28. Mai 2015 um 23:20 Uhr
Von: Shervin Afshar shervinafs...@gmail.commailto:shervinafs...@gmail.com
An: Shawn Steele 
shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com
Cc: verd...@wanadoo.frmailto:verd...@wanadoo.fr 
verd...@wanadoo.frmailto:verd...@wanadoo.fr, unicode Unicode Discussion 
unicode@unicode.orgmailto:unicode@unicode.org, Jim Melton 
jim.mel...@oracle.commailto:jim.mel...@oracle.com
Betreff: Re: Bunny hill symbol, used in America for signaling ski pistes for 
novices
Since the double-diamond has map and map legend usage, it might be a good idea 
to have it encoded separately. I know that I'm stating the obvious here, but 
the important point is doing the research and showing that it has widespread 
usage.

↪ Shervin

On Thu, May 28, 2015 at 2:15 PM, Shawn Steele 
shawn.ste...@microsoft.comhttp://shawn.ste...@microsoft.com wrote:
I’m used to them being next to each other.  So the entire discussion seems to 
be about how to encode a concept vs how to get the shape you want with existing 
code points.   If you just want the perfect shape, then maybe an svg is a 
better choice.  If we’re talking about describing ski-run difficulty levels in 
plain-text, then the hodge-podge of glyphs being offered in this thread seems 
kinda hacky to me.

-Shawn

From: ver...@gmail.comhttp://ver...@gmail.com 
[mailto:ver...@gmail.comhttp://ver...@gmail.com] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 2:12 PM
To: Jim Melton
Cc: Shawn Steele; unicode Unicode Discussion
Subject: Re: Bunny hill symbol, used in America for signaling ski pistes for 
novices

Some documentations also suggest that the two diamonds are not stacked one 
above the other, but horizontally. It's a good point for using only one symbol, 
encoding it twice in plain-text if needed.

2015-05-28 22:15 GMT+02:00 Jim Melton 
jim.mel...@oracle.comhttp://jim.mel...@oracle.com:
I no longer ski, but I did so for many years, mostly (but not exclusively) in 
the western United States.  I never encountered, at any USA ski 
hill/mountain/resort, a special symbol for bunny hills, which are typically 
represented by the green circle meaning beginner.  That's anecdotal evidence 
at best, but my observations cover numerous skiing sites.  I have encountered 
such a symbol in Europe and in New Zealand, but not in the USA.  (I have not 
had the pleasure of skiing in Canada and am thus unable to speak about ski 
areas in that country.)

The double black diamond would appear to be a unique symbol worthy of encoding, 
simply because the only valid typographical representation (in the USA) is two 
single black diamonds stacked one above the other and touching at the points.

Hope this helps,
   Jim

On 5/28/2015 2:04 PM, Shawn Steele wrote:
So is double black diamond a separate symbol?  Or just two of the black diamond?

And Blue-Black?

I’m drawing a blank on a specific bunny sign, in my experience those are 
usually just green.

Aren’t there a lot of cartography symbols for various systems that aren’t 
present in Unicode?

From: Unicode 
[mailto:unicode-boun...@unicode.orghttp://unicode-boun...@unicode.org] On 
Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 12:47 PM
To: unicode Unicode Discussion
Subject: Bunny hill symbol, used in America for signaling ski pistes for 
novices

Is there a symbol that can represent the Bunny hill symbol used in North 
America and some other American territories with mountains

RE: Bunny hill symbol, used in America for signaling ski pistes for novices

2015-05-28 Thread Shawn Steele
I’m used to them being next to each other.  So the entire discussion seems to 
be about how to encode a concept vs how to get the shape you want with existing 
code points.   If you just want the perfect shape, then maybe an svg is a 
better choice.  If we’re talking about describing ski-run difficulty levels in 
plain-text, then the hodge-podge of glyphs being offered in this thread seems 
kinda hacky to me.

-Shawn

From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 2:12 PM
To: Jim Melton
Cc: Shawn Steele; unicode Unicode Discussion
Subject: Re: Bunny hill symbol, used in America for signaling ski pistes for 
novices

Some documentations also suggest that the two diamonds are not stacked one 
above the other, but horizontally. It's a good point for using only one symbol, 
encoding it twice in plain-text if needed.

2015-05-28 22:15 GMT+02:00 Jim Melton 
jim.mel...@oracle.commailto:jim.mel...@oracle.com:
I no longer ski, but I did so for many years, mostly (but not exclusively) in 
the western United States.  I never encountered, at any USA ski 
hill/mountain/resort, a special symbol for bunny hills, which are typically 
represented by the green circle meaning beginner.  That's anecdotal evidence 
at best, but my observations cover numerous skiing sites.  I have encountered 
such a symbol in Europe and in New Zealand, but not in the USA.  (I have not 
had the pleasure of skiing in Canada and am thus unable to speak about ski 
areas in that country.)

The double black diamond would appear to be a unique symbol worthy of encoding, 
simply because the only valid typographical representation (in the USA) is two 
single black diamonds stacked one above the other and touching at the points.

Hope this helps,
   Jim

On 5/28/2015 2:04 PM, Shawn Steele wrote:
So is double black diamond a separate symbol?  Or just two of the black diamond?

And Blue-Black?

I’m drawing a blank on a specific bunny sign, in my experience those are 
usually just green.

Aren’t there a lot of cartography symbols for various systems that aren’t 
present in Unicode?

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 12:47 PM
To: unicode Unicode Discussion
Subject: Bunny hill symbol, used in America for signaling ski pistes for 
novices

Is there a symbol that can represent the Bunny hill symbol used in North 
America and some other American territories with mountains, to designate the 
ski pistes open to novice skiers (those pistes are signaled with green signs in 
Europe).

I'm looking for the symbol itself, not the color, or the form of the sign.

For example blue pistes in Europe are designed with a green circle in America, 
but we have a symbol for the circle; red pistes in Europe are signaled by a 
blue square in America, but we have a symbol for the square; black pistes in 
Europe are signaled by a black diamond in America, but we also have such 
black diamond in Unicode.

But I can't find an equivalent to the American Bunny hill signal, equivalent 
to green pistes in Europe (this is a problem for webpages related to skiing: do 
we have to embed an image ?).



--



Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144

  Chair, ISO/IEC JTC1/SC32 and W3C XML Query WGFax : +1.801.942.3345

Oracle CorporationOracle Email: jim dot melton at oracle dot com

1930 Viscounti Drive  Alternate email: jim dot melton at acm dot org

Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com



=  Facts are facts.   But any opinions expressed are the opinions  =

=  only of myself and may or may not reflect the opinions of anybody   =

=  else with whom I may or may not have discussed the issues at hand.  =





RE: Bunny hill symbol, used in America for signaling ski pistes for novices

2015-05-28 Thread Shawn Steele
I’m wondering if it’s a regional thing, I haven’t seen it, at least in the 
mostly-west of North America.  An east coast thing?

From: Jim Melton [mailto:jim.mel...@oracle.com]
Sent: Thursday, May 28, 2015 1:16 PM
To: Shawn Steele
Cc: verd...@wanadoo.fr; unicode Unicode Discussion
Subject: Re: Bunny hill symbol, used in America for signaling ski pistes for 
novices

I no longer ski, but I did so for many years, mostly (but not exclusively) in 
the western United States.  I never encountered, at any USA ski 
hill/mountain/resort, a special symbol for bunny hills, which are typically 
represented by the green circle meaning beginner.  That's anecdotal evidence 
at best, but my observations cover numerous skiing sites.  I have encountered 
such a symbol in Europe and in New Zealand, but not in the USA.  (I have not 
had the pleasure of skiing in Canada and am thus unable to speak about ski 
areas in that country.)

The double black diamond would appear to be a unique symbol worthy of encoding, 
simply because the only valid typographical representation (in the USA) is two 
single black diamonds stacked one above the other and touching at the points.

Hope this helps,
   Jim

On 5/28/2015 2:04 PM, Shawn Steele wrote:
So is double black diamond a separate symbol?  Or just two of the black diamond?

And Blue-Black?

I’m drawing a blank on a specific bunny sign, in my experience those are 
usually just green.

Aren’t there a lot of cartography symbols for various systems that aren’t 
present in Unicode?

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 12:47 PM
To: unicode Unicode Discussion
Subject: Bunny hill symbol, used in America for signaling ski pistes for 
novices

Is there a symbol that can represent the Bunny hill symbol used in North 
America and some other American territories with mountains, to designate the 
ski pistes open to novice skiers (those pistes are signaled with green signs in 
Europe).

I'm looking for the symbol itself, not the color, or the form of the sign.

For example blue pistes in Europe are designed with a green circle in America, 
but we have a symbol for the circle; red pistes in Europe are signaled by a 
blue square in America, but we have a symbol for the square; black pistes in 
Europe are signaled by a black diamond in America, but we also have such 
black diamond in Unicode.

But I can't find an equivalent to the American Bunny hill signal, equivalent 
to green pistes in Europe (this is a problem for webpages related to skiing: do 
we have to embed an image ?).




--



Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144

  Chair, ISO/IEC JTC1/SC32 and W3C XML Query WGFax : +1.801.942.3345

Oracle CorporationOracle Email: jim dot melton at oracle dot com

1930 Viscounti Drive  Alternate email: jim dot melton at acm dot org

Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com



=  Facts are facts.   But any opinions expressed are the opinions  =

=  only of myself and may or may not reflect the opinions of anybody   =

=  else with whom I may or may not have discussed the issues at hand.  =




RE: Bunny hill symbol, used in America for signaling ski pistes for novices

2015-05-28 Thread Shawn Steele
What is the image?, curiosity killed the bunny ☺  I expect that it’s limited to 
a single ski area or maybe region.

From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 3:01 PM
To: Shawn Steele
Cc: Doug Ewell; Unicode Mailing List
Subject: Re: Bunny hill symbol, used in America for signaling ski pistes for 
novices

The rope (or other barriers) are also present in Europe, but they are 
considered true pistes by themselves, even if they are relatively short. In 
frequent cases they are connected upward to a blue piste (not for novices) but 
there are slow down warnings displayed on them and the regulation requires 
taking care of every skier that could be in front of you.

Various tools are used to force skiers to slow down, including forcing them to 
slalom between barriers, or including flat sections or sections going upward, 
and adding a large rest area around this interconnection.

The European green pistes for novices are also relatively well separated from 
blue pistes (used by all other skiers and interconnected with mor difficult 
ones: red and black): if there's a blue piste, it will most often be parallel 
and separated physically by barriers, this limits the number of intersections 
or the need for interconnections (the only intersection is then at the station 
itself, in a crowded area near the equipments to bring skiers to the upper part 
of the piste).

But my initial question was about the symbol that I have seen (partly) 
documented without an actual image for ski stations in US. May be the bunny 
hills symbol is specific to a station, not used elsewhere, or there are other 
similar symbols used locally. I wonder if this is not simply the symbol/logo of 
a local ski school...

2015-05-28 23:44 GMT+02:00 Shawn Steele 
shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com:
Typically we have “slow” zones with include both “novice” areas and congested 
areas.  Additionally the “novice” part of a slope often has a rope fence 
delineating it from the rest of the slow.  However on the maps, etc, its 
usually just off to the side of a green run and doesn’t have a special symbol.

From: Unicode 
[mailto:unicode-boun...@unicode.orgmailto:unicode-boun...@unicode.org] On 
Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 2:26 PM
To: Doug Ewell
Cc: Unicode Mailing List
Subject: Re: Bunny hill symbol, used in America for signaling ski pistes for 
novices

2015-05-28 22:59 GMT+02:00 Doug Ewell 
d...@ewellic.orgmailto:d...@ewellic.org:
Looks like a green circle is the symbol for a beginner slope. (The first
link also shows that piste is the European word for what we call a
trail, run, or slope). There is no difference between a bunny slope
and a beginner or novice slope.

The difference is obvious in Europe where the novice difficulty is marked as 
green pistes (slopes are below 30% or almost flat), and the beginner/moderate 
difficulty is marked as blue pistes (slopes about 30-35%).

Even America must have this novice difficulty, with areas mostly used by 
young children (with their parents not skiing but following them by foot, and a 
restriction of speeds); these areas are protected so that other skiers will not 
pass through them. In fact if you remain on these novice areas you cannot reach 
any speed that could cause dangerous shocks: you have to push to advance, 
otherwise you'll slow down naturally and stop on the snow.

These areas can be used by walkers, and randonners using raquettes.





RE: Bunny hill symbol, used in America for signaling ski pistes for novices

2015-05-28 Thread Shawn Steele
So is double black diamond a separate symbol?  Or just two of the black diamond?

And Blue-Black?

I’m drawing a blank on a specific bunny sign, in my experience those are 
usually just green.

Aren’t there a lot of cartography symbols for various systems that aren’t 
present in Unicode?

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 12:47 PM
To: unicode Unicode Discussion
Subject: Bunny hill symbol, used in America for signaling ski pistes for 
novices

Is there a symbol that can represent the Bunny hill symbol used in North 
America and some other American territories with mountains, to designate the 
ski pistes open to novice skiers (those pistes are signaled with green signs in 
Europe).

I'm looking for the symbol itself, not the color, or the form of the sign.

For example blue pistes in Europe are designed with a green circle in America, 
but we have a symbol for the circle; red pistes in Europe are signaled by a 
blue square in America, but we have a symbol for the square; black pistes in 
Europe are signaled by a black diamond in America, but we also have such 
black diamond in Unicode.

But I can't find an equivalent to the American Bunny hill signal, equivalent 
to green pistes in Europe (this is a problem for webpages related to skiing: do 
we have to embed an image ?).



RE: Bunny hill symbol, used in America for signaling ski pistes for novices

2015-05-28 Thread Shawn Steele
Typically we have “slow” zones with include both “novice” areas and congested 
areas.  Additionally the “novice” part of a slope often has a rope fence 
delineating it from the rest of the slow.  However on the maps, etc, its 
usually just off to the side of a green run and doesn’t have a special symbol.

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 2:26 PM
To: Doug Ewell
Cc: Unicode Mailing List
Subject: Re: Bunny hill symbol, used in America for signaling ski pistes for 
novices

2015-05-28 22:59 GMT+02:00 Doug Ewell 
d...@ewellic.orgmailto:d...@ewellic.org:
Looks like a green circle is the symbol for a beginner slope. (The first
link also shows that piste is the European word for what we call a
trail, run, or slope). There is no difference between a bunny slope
and a beginner or novice slope.

The difference is obvious in Europe where the novice difficulty is marked as 
green pistes (slopes are below 30% or almost flat), and the beginner/moderate 
difficulty is marked as blue pistes (slopes about 30-35%).

Even America must have this novice difficulty, with areas mostly used by 
young children (with their parents not skiing but following them by foot, and a 
restriction of speeds); these areas are protected so that other skiers will not 
pass through them. In fact if you remain on these novice areas you cannot reach 
any speed that could cause dangerous shocks: you have to push to advance, 
otherwise you'll slow down naturally and stop on the snow.

These areas can be used by walkers, and randonners using raquettes.




RE: Tag characters

2015-05-20 Thread Shawn Steele
I've always been a bit partial to them and found it odd that they are 
intentionally not included in Unicode.  Especially the novel concepts like the 
repeats.

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham
Sent: Wednesday, May 20, 2015 6:08 PM
To: unicode@unicode.org
Subject: Re: Tag characters

On Wed, 20 May 2015 17:15:28 -0700
Asmus Freytag (t) asmus-...@ix.netcom.com wrote:

 Have there been any discussions of the flag alphabet? (Signal flags).

 It seems to me that when schemes for representing sets of flags are 
 discussed, it would be useful to keep open the ability to use the same 
 scheme for signal flags -- perhaps with a different base character to 
 avoid collisions in the letter codes.

If these are worthy of coding, I think the Unified Canadian Aboriginal 
Syllabics would be a better model - encode the form, not the semantic.
Braille is another precedent.

Richard.



RE: Characters that should be displayed?

2014-06-29 Thread Shawn Steele
If the concern is security, I cannot imagine why CSS would even want something 
like BELL to be legal at all.  

I'm not sure that replacement glyphs would help much.  I mean would someone 
thing that �Shawn was something spoofing Shawn, or just assume their 
browser/computer had a rendering glitch?  I think most people would just ignore 
the unexpected character and assume something was quirky with the web page.

-Shawn

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Koji Ishii
Sent: Sunday, June 29, 2014 11:44 AM
To: Unicode Mailing List
Subject: Characters that should be displayed?

Hello Unicoders,

I’m a co-editor of CSS Text Level 3[1], and I would appreciate your support in 
defining rendering behavior in CSS.

The spec currently has the following text[2]:

 Control characters (Unicode class Cc) other than tab (U+0009), line feed 
 (U+000A), and carriage return (U+000D) are ignored for the purpose of 
 rendering. (As required by [UNICODE], unsupported Default_ignorable 
 characters must also be ignored for rendering.)

and there’s a feedback saying that CSS should display visible glyphs for these 
control characters. Since all major browsers do not display them today, this is 
a breaking-change and the CSS WG needs to discuss on this feedback. But the WG 
would appreciate to understand what Unicode recommends.

I found the following text in Unicode 6.3, p. 185, 5.21 Ignoring Characters in 
Processing”[3]:

 Surrogate code points, private-use characters, and control characters are not 
 given the Default_Ignorable_Code_Point property. To avoid security problems, 
 such characters or code points, when not interpreted and not displayable by 
 normal rendering, should be displayed in fallback rendering with a fallback 
 glyph

By looking at this, my questions are as follows:

1. Should control characters that browsers do not interpret be displayed in 
fallback rendering?
2. Should private-use characters (U+E000-F8FF, 0F-0D, 10-10FFFD) 
without glyphs be displayed in fallback rendering?

These two questions are probably yes from what I understand the text quoted 
above, but things get harder the more I think:

3. When the above text says “surrogate code points”, does that mean everything 
outside BMP? It reads so to me, but I’m surprised that characters in BMP and 
outside BMP have such differences, so I’m doubting my English skill.
4. Should every code point that are not given the Default_Ignorable_Code_Point 
property and that without interpretations nor glyphs displayed in fallback 
rendering? I could not find such statement in Unicode spec, but there are some 
people who believe so.
5. Is there anything else Unicode recommends to display in fallback rendering, 
or not to display? This must be RTFM, but pointing out where to read would be 
appreciated.

Thank you for your support in advance.

[1] http://dev.w3.org/csswg/css-text/
[2] http://dev.w3.org/csswg/css-text/#white-space-processing
[3] http://www.unicode.org/versions/Unicode6.3.0/ch05.pdf

/koji


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Characters that should be displayed?

2014-06-29 Thread Shawn Steele
Corrected typo, sorry. (someone thing/someone think)

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Shawn Steele
Sent: Sunday, June 29, 2014 11:59 AM
To: Koji Ishii; Unicode Mailing List
Subject: RE: Characters that should be displayed?

If the concern is security, I cannot imagine why CSS would even want something 
like BELL to be legal at all.  

I'm not sure that replacement glyphs would help much.  I mean would someone 
think that �Shawn was something spoofing Shawn, or just assume their 
browser/computer had a rendering glitch?  I think most people would just ignore 
the unexpected character and assume something was quirky with the web page.

-Shawn

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Koji Ishii
Sent: Sunday, June 29, 2014 11:44 AM
To: Unicode Mailing List
Subject: Characters that should be displayed?

Hello Unicoders,

I’m a co-editor of CSS Text Level 3[1], and I would appreciate your support in 
defining rendering behavior in CSS.

The spec currently has the following text[2]:

 Control characters (Unicode class Cc) other than tab (U+0009), line feed 
 (U+000A), and carriage return (U+000D) are ignored for the purpose of 
 rendering. (As required by [UNICODE], unsupported Default_ignorable 
 characters must also be ignored for rendering.)

and there’s a feedback saying that CSS should display visible glyphs for these 
control characters. Since all major browsers do not display them today, this is 
a breaking-change and the CSS WG needs to discuss on this feedback. But the WG 
would appreciate to understand what Unicode recommends.

I found the following text in Unicode 6.3, p. 185, 5.21 Ignoring Characters in 
Processing”[3]:

 Surrogate code points, private-use characters, and control characters are not 
 given the Default_Ignorable_Code_Point property. To avoid security problems, 
 such characters or code points, when not interpreted and not displayable by 
 normal rendering, should be displayed in fallback rendering with a fallback 
 glyph

By looking at this, my questions are as follows:

1. Should control characters that browsers do not interpret be displayed in 
fallback rendering?
2. Should private-use characters (U+E000-F8FF, 0F-0D, 10-10FFFD) 
without glyphs be displayed in fallback rendering?

These two questions are probably yes from what I understand the text quoted 
above, but things get harder the more I think:

3. When the above text says “surrogate code points”, does that mean everything 
outside BMP? It reads so to me, but I’m surprised that characters in BMP and 
outside BMP have such differences, so I’m doubting my English skill.
4. Should every code point that are not given the Default_Ignorable_Code_Point 
property and that without interpretations nor glyphs displayed in fallback 
rendering? I could not find such statement in Unicode spec, but there are some 
people who believe so.
5. Is there anything else Unicode recommends to display in fallback rendering, 
or not to display? This must be RTFM, but pointing out where to read would be 
appreciated.

Thank you for your support in advance.

[1] http://dev.w3.org/csswg/css-text/
[2] http://dev.w3.org/csswg/css-text/#white-space-processing
[3] http://www.unicode.org/versions/Unicode6.3.0/ch05.pdf

/koji


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-08 Thread Shawn Steele
 I should note that this front-end to 'diff' changes the input files, writes 
 the modified versions out, and calls 'diff' with those modified files as its 
 inputs.  By using noncharacters, it would be depending on 'diff' to 1) not 
 use them, and 2) to not filter them out, and 3) for the system to be able to 
 store and retrieve them in files.

In my view that is still internal to your apps use of these characters :)

The original text doesn't say that my application cannot store  retrieve them 
from files for internal use.  On the contrary, I'd expect proprietary formats 
for internal use to require that.  I agree that the original text is a bit 
vague on the question of tools to inspect/modify/whatever your internal use.

-Shawn

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Swift

2014-06-04 Thread Shawn Steele
I’m sort of confused why Unicode would be a big deal.  C#  other languages 
have allowed unicode letters in identifiers for years, so readable strings 
should be possible in almost any language.

It’s a bit cute to include emoji, but I’m not sure how practical it is.  It 
also makes me wonder how they came up with the list, I presume control codes 
aren’t allowed?  Or alternate whitespace?  I assume they use some Unicode 
Categories to figure out the permitted set?

I rarely see non-Latin code in practice though, but of course I’m a native 
English speaker.

-Shawn

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Mark Davis ??
Sent: Wednesday, June 4, 2014 2:41 AM
To: Andre Schappo
Cc: unicode@unicode.org
Subject: Re: Swift

Apparently you can use emoji in the identifiers. 

(http://www.globalnerdy.com/2014/06/03/swift-fun-fact-1-you-can-use-emoji-characters-in-variable-constant-function-and-class-names/)


Markhttps://google.com/+MarkDavis

— Il meglio è l’inimico del bene —

On Wed, Jun 4, 2014 at 11:28 AM, Andre Schappo 
a.scha...@lboro.ac.ukmailto:a.scha...@lboro.ac.uk wrote:
Swift is Apple's new programming language. In Swift, variable and constant 
names can be constructed from Unicode characters. Here are a couple of examples 
from Apple's doc 
http://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html

let π = 3.14159
let 你好 = 你好世界

I think this a huge step forward for i18n and Unicode.

There are some restrictions on which Unicode chars can be used. From Apple's doc

Constant and variable names cannot contain mathematical symbols, arrows, 
private-use (or invalid) Unicode code points, or line- and box-drawing 
characters. Nor can they begin with a number, although numbers may be included 
elsewhere within the name.

The restrictions seem a little like IDNA2008. Anyone have links to info giving 
a detailed explanation/tabulation of allowed and non allowed Unicode chars for 
Swift Variable and Constant names?

André Schappo


___
Unicode mailing list
Unicode@unicode.orgmailto:Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-04 Thread Shawn Steele
The BOM I've seen (not FFFE though), it's prevalence depends on the system and 
other factors. 

The others I only see if there's corruption, bugs, or tests.  The most common 
error I see that causes those is when some developer calls a binary blob a 
unicode string and tries to shove it through a text transport or something.  
Usually that bites them sooner or later.

-Shawn

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Doug Ewell
Sent: Wednesday, June 4, 2014 11:01 AM
To: unicode@unicode.org
Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

How common is it to see any of the following in real-world Unicode text, as 
opposed to code charts and test suites and the like?

1. Unpaired surrogates
2. Noncharacters (besides CLDR data)
3. U+FEFF at the beginning of a stream (note: not packet or arbitrary cutoff 
point)

I'm not asking whether any of these are recommended or prohibited or whether 
they are a good idea. I'm asking about actual usage.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
That’s what I think is exactly what should be clarified.  A cooperating system 
of apps should likely use some other markup, however if they want to use  
to say “OK to insert ad here” (or whatever), that’s up to them.

I fear that the current wording says “Because you might have a cooperating 
system of apps that all agree  is ‘OK to insert ad here’, you may as well 
emit  all the time just in case some other app happens to use the same 
sentinel”.

The “problem” is now that previously these characters were illegal, so my 
application didn’t have to explicitly remove them when importing external stuff 
because they weren’t allowed to be there.  With the wording of the corrigendum, 
the onus is on every app importing data to filter out these code points because 
they are “suddenly” legal in foreign data streams.

That is a breaking change for applications, and, worse, it isn’t in the control 
of the applications that take advantage of the newly laxer wording, but rather 
all the other applications on the planet, which may have been stable for years.

My interpretation of “interchanged” was “interchanged outside of a system that 
understood your private use of the noncharacters”.  I can see where that may 
not have been everyone’s interpretation, and maybe should be updated.  My 
interpretation of what you’re saying below is “sentinel values with a private 
meaning can be exchanged between apps”, which is what the PUA’s for.

I don’t mind at all if the definition is loosened somewhat, but if we’re 
turning them into PUA characters we should just turn them into PUA characters.

-Shawn

From: mark.edward.da...@gmail.com [mailto:mark.edward.da...@gmail.com] On 
Behalf Of Mark Davis ??
Sent: Monday, June 2, 2014 9:08 AM
To: Shawn Steele
Cc: Markus Scherer; Doug Ewell; Unicode Mailing List
Subject: Re: Corrigendum #9

The problem is where to draw the line. In today's world, what's an app? You may 
have a cooperating system of apps, where it is perfectly reasonable to 
interchange sentinel values (for example).

I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where 
we should make it clearer.)


Markhttps://google.com/+MarkDavis

— Il meglio è l’inimico del bene —

On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele 
shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com wrote:
I also think that the verbiage swung too far the other way.  Sure, I might need 
to save or transmit a file to talk to myself later, but apps should be strongly 
discouraged for using these for interchange with other apps.

Interchange bugs are why nearly any news web site ends up with at least a few 
articles with mangled apostrophes or whatever (because of encoding 
differences).  Should authors’ tools or feeds or databases or whatever start 
emitting non-characters from internal use, then we’re going to have ugly leak 
into text “everywhere”.

So I’d prefer to see text that better permitted interchange with other 
components of an application’s internal system or partner system, yet 
discouraged use for interchange with “foreign” apps.

-Shawn


___
Unicode mailing list
Unicode@unicode.orgmailto:Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
  I agree with Markus; I think the FAQ is pretty clear. (And if not, 
  that's where we should make it clearer.)

 But the formal wording of the standard should reflect that clarity, right?

I don't tend to read the FAQ :)

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
To further my understanding, can someone provide examples of how these are used 
in actual practice?  I can't think of any offhand and the closest I get is like 
the old escape characters to get a dot matrix printer to shift modes, or old 
word processor internal formatting sequences.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
 Oh, look. My mail system converted those nice noncharacters into U+FFFD.
 Was that compliant? Did I deserve what I got? Are those two different 
 questions?

I think I just got spaces.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
Hmm, I find that disconcerting.  I’d prefer a real Unicode character with 
special weights if that concept’s needed.  And I guess that goes a long ways to 
explaining the interchange problem since clearly the code editor’s going to 
need these ☹

From: Markus Scherer [mailto:markus@gmail.com]
Sent: Monday, June 2, 2014 10:17 AM
To: Shawn Steele
Cc: Asmus Freytag; Doug Ewell; Mark Davis ☕️; Unicode Mailing List
Subject: Re: Corrigendum #9

On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele 
shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com wrote:
To further my understanding, can someone provide examples of how these are used 
in actual practice?

CLDR collation data defines special contraction mappings that start with a 
noncharacter, for 
http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers

In CLDR 23 and before (when we were still using XML collation syntax), these 
were raw noncharacters in the .xml files.

As I said earlier:
it should be ok to include noncharacters in CLDR data files for processing by 
CLDR implementations, and it should be possible to edit and diff and 
version-control and web-view those files etc.

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
Ø  I can't shake the suspicion that Corrigendum #9 is not actually solving a 
general problem, but is a special favor to CLDR as being run by insiders, and 
in the process muddying the waters for everyone else

I think we could generalize to other scenarios so it wasn’t necessarily an 
insider scenario.  For example, I could have a string manipulation library that 
used FFFE to indicate the beginning of an identifier for a localizable 
sentence, terminated by .  Any system using FFFEid1234 would likely 
expect to be able to read the tokens in their favorite code editor.

But I’m concerned that these “conflict” with each other, and embedding the 
behavior in major programming languages doesn’t smell to me like “internal” 
use.  Clearly if I wanted to use that library in a CLDR-aware app, there is a 
potential risk for a conflict.

In the CLDR case, there *IS* a special relationship with Unicode, and perhaps 
it would be warranted to explicitly encode character(s) with the necessary 
meaning(s) to handle edge-case collation scenarios.

-Shawn
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
EAI doesn't really specify anything more than the older SMTP about validating 
email addresses.  Everything in the local part = U+0080 is permissible and up 
to the server to sort out what characters it wants to allow, how it wants to 
map things like Turkish I, etc.  Some code points are clearly really unhelpful 
in an email local part, but the EAI RFCs leave it up to the servers how they 
want to assign mailboxes.

Obviously you could check the domain name to make sure it's a valid domain 
name, and the ASCII range of the local part to make sure it respects the 
earlier RFCs, and the lengths, but you won't really know if it's a legal name 
until the mail does/doesn't get accepted by the server.  AFAIK there isn't a 
published regex for doing the limited validation that is possible.

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Lin
Sent: Wednesday, October 30, 2013 1:42 PM
To: cldr-us...@unicode.org; unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin james_...@symantec.commailto:james_...@symantec.com
Date: Wednesday, October 30, 2013 at 1:11 PM
To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org 
cldr-us...@unicode.orgmailto:cldr-us...@unicode.org
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James


RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
Mixed script stuff considerations are all supposed to be done by the mailbox 
administrator.  It's perfectly valid for a domain to assign Latin addresses and 
also Cyrillic ones.  Indeed for Cyrillic EAI, one probably would almost 
certainly require ASCII (eg: Latin) aliases during whatever the transition 
period is.

A German mailbox admins may only allow German letters and no other Latin 
characters in their mailbox names.  Other admins may want to allow Latin 
characters with other scripts (CJK locales come to mind).  And a Russian admin 
may provide all-Cyrillic mailboxes with all-Latin aliases to those names.  
(Hopefully that admin's being careful about homographs, but the standards still 
let the admin make the decisions).

The PUA isn't even forbidden (I'm hoping for a pIqaD alias some day).

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Lin
Sent: Wednesday, October 30, 2013 2:58 PM
To: Paweł Dyda
Cc: cldr-us...@unicode.org; unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):

  *Phags-pa scripts

 *   Chinese: Traditional/Simplified
 *   Mongolian
 *   Sanskrit
 *   ...

  *   Kana scripts

 *   Japanese: hirakana/Katakana
 *   ...

  *   Hebrew scripts

 *   Yiddish
 *   Hebrew
 *   Bukhori
 *   ...

  *   Latin scripts

 *   English
 *   Italian
 *   

  *   Hangul scripts

 *   Korean

  *   Cyrillic Scripts

 *   Russian
 *   Bulgarian
 *   Ukrainian
 *   ...
By focusing on each scripts to derive a regular expression, I was wondering if 
such validation can be accomplished here.

Of course, RFC3696 standardize all email formatting rules and we can use such 
rule to validate the format before checking the scripts for validity.

Warm Regards,
-James Lin



From: Paweł Dyda pawel.d...@gmail.commailto:pawel.d...@gmail.com
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin james_...@symantec.commailto:james_...@symantec.com
Cc: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org 
cldr-us...@unicode.orgmailto:cldr-us...@unicode.org, Unicode List 
unicode@unicode.orgmailto:unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi James,
I am not sure if you have seen my email, but... I believe Regular Expressions 
are not a valid tool for that job (that is validating Int'l email address 
format).

In the internal email I especially gave one specific example, where to my 
knowledge it is (nearly) impossible to use Regular Expression to validate email 
address.

The reason I gave was mixed-script scenario.

How can we ensure that we allow mixture of  Hiragana, Katakana and Latin, while 
basically disallowing any other combinations with Latin (especially Latin + 
Cyrillic or Latin + Greek)?
I am really curious to know...
And of course there are several single-script (homographs and alike) attacks 
that we might want to prevent. I don't think it is even remotely possible with 
Regular Expressions. Please correct me if I am wrong.
Cheers,
Paweł.

2013/10/30 James Lin james_...@symantec.commailto:james_...@symantec.com
Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin james_...@symantec.commailto:james_...@symantec.com
Date: Wednesday, October 30, 2013 at 1:11 PM
To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org 
cldr-us...@unicode.orgmailto:cldr-us...@unicode.org
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James



RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
For EAI (the question being asked), the entire address, local part and domain, 
are encoded in UTF-8.

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Philippe Verdy
Sent: Wednesday, October 30, 2013 4:08 PM
To: James Lin
Cc: Paweł Dyda; cldr-us...@unicode.org; unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

You should not ttempt to detect scripts or even assume that they are encoded 
based on Unicode, in the username part ; all you can do is to break at the 
first @ to split it between user name part and the domin name, then use the 
IDN specs to validate the domain name part.

* 1. Domain name part:

You may want to restrict only to internet domains (that must contain a dot 
before the TLD), and validate the TLD label in a list that you do not restrict 
for local usage only (such as .local or .localnet), or only for your own 
domain, but I suggest that you validte all these domains only by performing a 
MX request on your DNS server (this could take time to reply, unless you just 
check the TLD part, which should be cached most often, or using the DNS request 
only for domins not in a wellknown list of gTLD, plus all 2-letter ccTLD which 
are not in the private-use range of ISO 3166-1).

Note that to send a mail, you need a MX resolution on DNS to get the address of 
a mail server, but it does not mean it will be immediately and constantly 
reachable : the UIP you get may be temporrily unreachable (due to your ISP or 
local routing problems, or because the remote mail server is temporarily offine 
or overloaded). Performing an MX request however is much faster than trying to 
send a mail to it, because MX resoltuion will use your local DNS server cache 
and caches of offstream DNS servers of your ISP (you normally don't need to 
perform authoritative MX requests which requires recursive search from the 
root, bypassing all caches, and the scalability of the DNS system (so it's not 
a good policy to do it by default).

If you need security, authoritative DNS queries should be replaced by secure 
emails based on direct authentication with the mail server at strt of the SMTP 
session. authoritative DNS queries should be performed only if this 
authentication fails (in order to bypass incorrect data in DNS caches), but not 
automaticlly (this could be caused by problems on your own site), so delay 
these unchecked email addresses in your database (the problem may be solved 
without doing anything when your server will retry several minutes or hours 
later, when it will have successed in sending the validation email for your 
subscribers).

Do not insert in your database any email addresses coming from any source you 
don't trust for having received the approval by the mail address owner, or not 
obeying to the same explicit approval policy seen by that user, or that is not 
in a domain in your own control ; otherwise you risk being flagged as spamming 
and have your site blocked on various mail servers: you need to send the 
validation email without sending any other kind of advertising, except your own 
identity.

Note that instead of a domain, you *may* accept a host name with an IPv4 
address (in decimal dotted format), or an IPv6 address (within [brackets], and 
in hexadecimal with colons), or some other host name formats for specific 
mail/messaging transport protocols you accept, for example 
username@[irc:ircservernname:port:channelname], or username@{uuid} using 
other punctuation not valid in domain names.


* 2. User name part:

There's no standard encoding there.

- Do not assume any encoding (unless you know the encoding used on each 
specific domain !). This part never obeys the IDNA.
- Every unrestricted byte in the printable 7-bit ASCII range, and all bytes in 
0x80..0xFF are valid in any sequence.
- Only few punctuations of the ASCII range need to be checked according to the 
RFC's.
- Never canonicalise user names by forcing the capitalisation (not even for 
the basic Latin letters : user names could be encoded with Base-64 for example 
where letter case is significant), even if you can do it for the domain name 
part.




2013/10/30 James Lin james_...@symantec.commailto:james_...@symantec.com
Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):

  *Phags-pa scripts

 *   Chinese: Traditional/Simplified
 *   Mongolian
 *   Sanskrit
 *   ...

  *   Kana scripts

 *   Japanese: hirakana/Katakana
 *   ...

  *   Hebrew scripts

 *   Yiddish
 *  

RE: Terminology question re ASCII

2013-10-29 Thread Shawn Steele
I would concur.  When I hear “8 bit ASCII” the context is usually confusing the 
term with any of what we call “ANSI Code Pages” in Windows.  (or similar ideas 
on other systems).

It’s also usually the prelude to a conversation asking the requestor to back up 
5 or 6 steps and explain what they’re really trying to do because something’s 
probably a bit confused.

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Philippe Verdy
Sent: Tuesday, October 29, 2013 7:49 AM
To: Mark Davis ☕
Cc: Donald Z. Osborn; unicode
Subject: Re: Terminology question re ASCII

8-bit ASCII is not so clear !

The reason for that is the historic documentation of many softwares, notably 
for the BASIC language, or similar like Excel, or even more recent languages 
like PHP, offering functions like CHR$(number) and ASC(string) to convert a 
string to the numeric 8-bit ASCII code of its first character or the 
reverse. The effective encoding of strings was in fact not specified at all and 
could be any 8-bit encoding used on the platform.

Only in more recent versions of implementtions of these languages, they specify 
that the encoding of their strings is now based on Unicode (most often UTF-16, 
so that 8-bit values now produce the same result as ISO-8859-1), but this is 
not enforced if a compatibility working mode was kept (e.g. in PHP which 
still uses unspecified 8-bit encodings for its strings in most of its API, or 
in Python that distinguishes types for 8-bit encoded strings and 
Unicode-encoded strings).


2013/10/29 Mark Davis ☕ m...@macchiato.commailto:m...@macchiato.com
Normally the term ASCII just refers to the 7-bit form. What is sometimes called 
8-bit ASCII is the same as ISO Latin 1. If you want to be completely clear, 
you can say 7-bit ASCII.


Markhttps://plus.google.com/114199149796022210033

— Il meglio è l’inimico del bene —

On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.netmailto:d...@bisharat.net 
wrote:
Quick question on terminology use concerning a legacy encoding:

If one refers to plain ASCII, or plain ASCII text or ... characters, 
should this be taken strictly as referring to the 7-bit basic characters, or 
might it encompass characters that might appear in an 8-bit character set (per 
the so-called extended ASCII)?

I've always used the term ASCII in the 7-bit, 128 character sense, and 
modifying it with plain seems to reinforce that sense. (Although plain text 
in my understanding actually refers to lack of formatting.)

Reason for asking is encountering a reference to plain ASCII describing text 
that clearly (by presence of accented characters) would be 8-bit.

The context is one of many situations where in attaching a document to an 
email, it is advisable to include an unformatted text version of the document 
in the body of the email. Never mind that the latter is probably in UTF-8 
anyway(?) - the issue here is the terminology.

TIA for any feedback.

Don Osborn

Sent via BlackBerry by ATT





RE: Bing now translates to/from Klingon

2013-05-17 Thread Shawn Steele
Hey, I know the guy that made that font!!!

http://www.bing.com/translator/?from=ento=tlh-qontext=Hello%20Unicode 

-Shawn

 
http://blogs.msdn.com/shawnste

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Karl Williamson
Sent: Friday, May 17, 2013 1:44 PM
To: unicode@unicode.org
Subject: Bing now translates to/from Klingon

http://www.bing.com/translator







RE: Bing now translates to/from Klingon

2013-05-17 Thread Shawn Steele
(Does this help the # of documents needing pIqaD to get it encoded?  Since now 
there's a bajillion that can be in pIqaD?)

http://www.nbcnews.com/technology/put-down-batleth-try-klingon-english-translator-1C9925541
 

-Original Message-
From: Shawn Steele 
Sent: Friday, May 17, 2013 2:06 PM
To: 'Karl Williamson'; unicode@unicode.org
Subject: RE: Bing now translates to/from Klingon

Hey, I know the guy that made that font!!!

http://www.bing.com/translator/?from=ento=tlh-qontext=Hello%20Unicode  

-Shawn

 
http://blogs.msdn.com/shawnste

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Karl Williamson
Sent: Friday, May 17, 2013 1:44 PM
To: unicode@unicode.org
Subject: Bing now translates to/from Klingon

http://www.bing.com/translator







RE: If Unicode wants to show the Red Card to someone ...

2013-04-01 Thread Shawn Steele
In the same spirit, if the proposed U+1F54F *were* encoded, then it might be 
easier to respond to the proposal in plain-text mail.

-Shawn

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Karl Pentzlin
Sent: Monday, April 1, 2013 2:52 AM
To: unicode@unicode.org
Subject: If Unicode wants to show the Red Card to someone ...

In the tradition of today's date to present proposals with a somewhat more 
entertaining subject than usual, there is:
Proposal to encode symbols for penalty cards in the UCS.
Until you find it on the usual lists, you can see it at:
http://www.acssoft.de/PenaltyV1.pdf

- Karl









RE: Are there any pre-Unicode 5.2 applications still in existence?

2013-03-08 Thread Shawn Steele
I think you can safely assume that apps exist that are not well behaved.

For this type of security problem, I always recommend validating strings after 
any possible transformations occur.  Any sort of conversion could be a problem. 
 Normally I talk about this in a convert from non-Unicode code page to 
Unicode context, eg: make sure you validate AFTER the conversion, but the 
concept applies most any time.

Unfortunately many apps do strange things.

-Shawn

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Costello, Roger L.
Sent: Friday, March 8, 2013 7:55 AM
To: unicode@unicode.org
Subject: Are there any pre-Unicode 5.2 applications still in existence?

Hi Folks,

I have learned that:

In some versions prior to Unicode 5.2, conformance clause C7
allowed the deletion of noncharacter code points [1]

Are there still in existence applications which delete noncharacter code points 
from strings?

Are there any pre-Unicode 5.2 applications still in existence?

The paper at [1] describes the security risk with deleting noncharacter code 
points. Is this risk still a concern, or can one assume that there are no more 
applications which delete noncharacter code points?

/Roger

[1] http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters









RE: pIqaD in actual use

2013-02-20 Thread Shawn Steele
There're examples of pIqaD in the PUA on this DL's archives.  Not quite sure 
how that got there.

-Shawn
 
http://blogs.msdn.com/shawnste

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of David Starner
Sent: ,  ,   :
To: Unicode Mailing List
Subject: pIqaD in actual use

According to Wikipedia:

In September 2011, Eurotalk released the Learn Klingon course in its Talk 
Now! range of over 130 languages and includes a choice of more than 120 
languages to learn from just by changing the help language.
The course is broken down into topics and made up of practice and learning 
games as well as the ability to test your skills with the speech recognition 
software. The language is displayed in both Latin and pIqaD fonts making this 
the first language course written in pIqaD and approved by CBS and Marc Okrand. 
It was translated by Jonathan Brown and Okrand and uses the Hol-pIqaD TrueType 
font.

That should help at least some of the pIqaD in real use problems, though not 
the OMG! Klingon problems.

--
Kie ekzistas vivo, ekzistas espero.






RE: Long-term archiving of electronic text documents

2013-01-28 Thread Shawn Steele
 UTF-256 allows each hex digit of UTF-32 to be expressed as an ASCII hex digit 
 (characters 0-9 and A-F encoded as bytes 0x30-0x39 and 0x41-0x46).

In my experience, I lose an entire block of a disk, or track, or drive, so 
redundancy at the character level isn’t likely to be very helpful, you’d need a 
minimum of 2 blocks/character following that logic.  Fortunately you did 
mention the scalability of UTF-256.

Historically, my biggest challenge with electronic data over time is being able 
to read the file… Nothing’s really “plain text”, so formats (and media) evolve 
and change.  Reading/converting my old C64 or Amiga stuff is a bit difficult 
these days.

-Shawn



RE: Normalization rate on the Web

2013-01-21 Thread Shawn Steele
I have no idea what the stats are, however some systems generate more NFC and 
others more NFD.  And then some publisher uses NFC systems but an author uses 
an NFD system, so the pages served end up with a mixture.

I generally recommend using comparisons and index keys that understand NFC/NFD 
and compare accurately regardless of the form.

-Shawn

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Denis Jacquerye
Sent: Monday, January 21, 2013 8:12 AM
To: Unicode Discussion
Subject: Normalization rate on the Web

Does anybody have any idea of how much of the Web is normalized in NFC or NFD? 
Or how much not normalized?

How would one find out or try to make a smart guess?

I know a lot of library catalogue data is in NFD or somewhat decomposed. Is 
there any other field that heavily uses decomposition?

--
Denis Moyogo Jacquerye
African Network for Localisation http://www.africanlocalisation.net/
Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/
DejaVu fonts --- http://www.dejavu-fonts.org/








RE: locale-aware string comparisons

2013-01-02 Thread Shawn Steele
I'd try to avoid making a dependency where case mapping needs to be the same as 
case insensitive comparisons.

I'd either always case fold then compare, or always compare case insensitive.

-Shawn

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Cloos
Sent: Tuesday, January 1, 2013 5:43 PM
To: Mark Davis ☕
Cc: Whistler, Ken; unicode@unicode.org
Subject: Re: locale-aware string comparisons

 MD == Mark Davis ☕ m...@macchiato.com writes:

MD All of these are different, all of them still have over 200 
MD differences from either compare(lower(x),lower(y)) or compare(upper
MD (x),upper(y))

What about, then:

  compare(lower(x),lower(y)) || compare(upper(x),upper(y))

Or, to emphasize that I mentioned C only as a pseudocode, akin to SQL:

  LOWER(x) LIKE LOWER(y) OR UPPER(x) LIKE UPPER(y)

Would that cover all of the outliers?

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6







RE: data for cp1252

2012-12-07 Thread Shawn Steele
I'm not sure what you expect :)  You've found consistent behavior, and it 
matches the behavior documented in 
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

IANA http://www.iana.org/assignments/charset-reg/windows-1252 points to both 
the non-best fit standard mapping and he best fit mapping.

-Shawn

From: Buck Golemon [mailto:b...@yelp.com]
Sent: Friday, December 7, 2012 3:03 PM
To: unicode; Shawn Steele
Subject: Re: data for cp1252

One week: bump.

On Wed, Nov 28, 2012 at 10:52 AM, Buck Golemon 
b...@yelp.commailto:b...@yelp.com wrote:
Shawn, can I get your comments?
Does this seem irrelevant?

On Mon, Nov 26, 2012 at 5:05 PM, Buck Golemon 
b...@yelp.commailto:b...@yelp.com wrote:
I've compiled cross-browser data on the question of how to cp1252 decodes the 
byte 0x81.

http://bukzor.github.com/encodings/cp1252.html

In summary, all browsers agree that it decodes to U+81. Opera initially thought 
it was undefined, but changed their mind in version 12 (the current version).




RE: data for cp1252

2012-12-07 Thread Shawn Steele
 In contrast, bringing the cp1252 definition into line with real 
 implementations and recommending UTF-8 for new developments are not mutually 
 exclusive.

Exactly?

If you already have existing data in 1252 or a variation (and can't tell them 
apart), then nothing's gained by making NEW requirements for 1252 which the old 
data won't conform to.  Changing standards or behavior will only break things 
that already work.

If you're creating new data, it should be using UTF-8 to avoid these kinds of 
ambiguity.

-Shawn

On Fri, Dec 7, 2012 at 4:41 PM, Shawn Steele 
shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com wrote:
It's a variation.  The undefined codepoints in 1252 probably shouldn't be used, 
and I can't imagine that adding a code page helps anything, nor that changing 
an existing behavior helps anything.  People really should be using UTF-8.

-Shawn

From: Buck Golemon [mailto:b...@yelp.commailto:b...@yelp.com]
Sent: Friday, December 7, 2012 4:34 PM
To: Shawn Steele
Cc: unicode

Subject: Re: data for cp1252

I've been told that bestfit1252 wasn't meant to redefine the cp1252 mapping, 
although its first line declares CODEPAGE 1252.

Is it a separate encoding or not?

If so, I'll submit a new bestfit1252 to the python stdlib.
If not, I believe the cp1252 mapping needs brought into line.


On Fri, Dec 7, 2012 at 4:27 PM, Shawn Steele 
shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com wrote:
J




RE: cp1252 decoder implementation

2012-11-24 Thread Shawn Steele
 No-one would be more happy than me if we could just ditch all the legacy 
 encodings and all switch to Unicode everywhere, but that will never happen. 
 There is enough legacy content out there that will never be converted.

That's sort of exactly the point: 

*NEW* content should be UTF-8 (or UTF-16) because everyone's learned how nasty 
encodings are.

*LEGACY* content is playing by whatever old rules it was using when it was 
created.  You can't fix that by updating or changing the standards that it 
might have been correctly or incorrectly depending on.  All that does is add 
more ambiguity to the existing content.

- Shawn




RE: cp1252 decoder implementation

2012-11-24 Thread Shawn Steele
Um, that one was really confused and didn't really work.  (IIRC it didn't round 
trip to the right encoding in some cases, and itself was causing some nasty 
compatibility problems before the tweak to the name.  Also, we still recognize 
the old bizarre name).

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Masatoshi Kimura
Sent: Wednesday, November 21, 2012 12:28 PM
To: unicode@unicode.org
Subject: Re: cp1252 decoder implementation

(2012/11/22 1:58), Shawn Steele wrote:
 We aren’t going change names (since that’ll break anyone already using 
 them), we probably won’t recognize new names (since anyone trying to 
 use a new name wouldn’t work on millions of existing computers, so no 
 one would add it).
Hey, why Microsoft changed unicodeFFFE to unicodeFEFF?
What's the benefit of sacrificing the backword compatibility?
According to Michael S. Kaplan, once some Microsoft people (including
you) said that they wouldn't change it because of compatibility.
https://blogs.msdn.com/b/michkap/archive/2005/09/11/463444.aspx?Redirected=true
I admit your have a valid point, but why don't you do what you says?
--
vyv03...@nifty.ne.jp







RE: cp1252 decoder implementation

2012-11-21 Thread Shawn Steele
I’ll be more definitive than Murray ☺  Our legacy code pages aren’t going to 
change.  We won’t add more characters to 1252.  We won’t add new code pages.  
We aren’t going change names (since that’ll break anyone already using them), 
we probably won’t recognize new names (since anyone trying to use a new name 
wouldn’t work on millions of existing computers, so no one would add it).

The churn is too painful for customers.  If there’s a new character that 
everyone “must” use, we’ll point them at UTF-8 or UTF-16.  Any request to 
change codepage behavior would have to meet a very high bar.

The status of these 5 characters is already in the best fit mappings document 
pointed to by the IANA registry entry for windows-1252, which is strong as I’m 
willing to go for them.

The last thing I did WRT to code page standards was to ask for the best fit 
mappings to be posted so that the IANA charset registry would have something to 
reference to clarify the existing names.  It’s possible (if I find the time) 
that a few of the IANA charset entries could be updated to emphasize that some 
common names have differing implementations by different vendors/OS’s such as 
was done for shift_jis http://www.iana.org/assignments/charset-reg/shift_jis or 
the updates to point out the best fit mapping for 1252 at 
http://www.iana.org/assignments/charset-reg/windows-1252  In other words, the 
trend is to clarify that there are variations in behavior, and to please use 
Unicode.

Also see:
http://blogs.msdn.com/b/shawnste/archive/2007/09/24/are-we-going-to-update-or-maintain-the-best-fit-or-code-page-mappings.aspx
http://blogs.msdn.com/b/shawnste/archive/2008/01/17/code-pages-and-security-issues.aspx
http://blogs.msdn.com/b/shawnste/archive/2007/03/20/some-reasons-to-make-your-application-unicode.aspx

(and 
http://blogs.msdn.com/b/shawnste/archive/2012/06/16/building-the-lego-disney-wonder.aspx
 just because I think it’s cool)

I can see why HTML5 might think windows-1252 support is a good idea, but 
personally I’d’ve been happier if it wasn’t a requirement.  Too much code page 
corruption happens on the web, and most of the badly-tagged content probably 
misdeclares itself as 1252.  UTF-8 is a WAY better choice, particularly for the 
characters in the set supported by windows-1252.

-Shawn
( )

SSDE,
Microsoft

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Murray Sargent
Sent: Tuesday, November 20, 2012 8:55 PM
To: verd...@wanadoo.fr; Doug Ewell
Cc: Unicode Mailing List; Buck Golemon
Subject: RE: cp1252 decoder implementation

Phillipe commented: “(even if later Microsoft decides to map some other 
characters in its own windows-1252 charset, like it did several times and 
notably when the Euro symbol was mapped)”.

Personal opinion, but I’d be very surprised if Microsoft ever changed the 1252 
charset. The euro was added back in 1999 when code pages were still used a lot. 
Code pages in general are pretty much irrelevant today except for reading 
legacy documents. They are virtually never used internally in modern software. 
UTF-8,UTF-16, and UTF-32 are what are used these days.

(But code pages do have the advantage that they are associated with specific 
character repertoires, which amounts to a great hint for font binding…)

Murray


RE: cp1252 decoder implementation

2012-11-18 Thread Shawn Steele
 What effort has been spent? This is not an either/or type of proposition.
 If we can agree that it's an improvement (albeit small), let's update the 
 mapping.
 Is it much harder than I believe it is?

What if some application's treating it as undefined?  And now the code page 
gets updated to say that it's a real mapping?  Then someone uses the code point 
and causes the application to break, and then they point to the updated 
standard, and say they aren't complient.

 Internally the app is fully utf8, but must accept (poorly encoded) input from 
 all over the web.

IMO, it's better to get that poorly encoded input to be correctly encoded.

A) If it really means CP 1252, it shouldn't really be using these code 
points, so defining these differently doesn't really solve anything.

B)  If the input isn't working right because of this, then something's 
wrong with the input, so they need to fix that.

I don't think it's worth the app developer's time, or this list's time, trying 
to fix something that's such a severe edge case.

 cp1252 is one of the two encodings that a browser *must* implement, according 
 to the html5 spec, so this is a very important encoding, second only to utf8.

If HTML 5 requires it because it's so common, so changing the definition of the 
behavior doesn't seem like a great idea.

 My essential point is that the latin1 mapping file specifies an encoding that 
 will succeed with arbitrary binary input.

Ah, but this is all about text, not arbitrary binary input.  Those 5 code 
points provide no value for text.  They aren't used, shouldn't be used, and 
aren't very useful even if they were used.  Expecting binary input to conform 
to a text encoding isn't a good idea.

By that logic, one would expect UTF-8 to accept arbitrary binary input.  
However 0x80, 0x80 needs to fail according to the standards, so even UTF-8 
can't accept arbitrary binary input.  If you need to transmit binary data, then 
send it in some non-text or appropriately encoded form.

-Shawn



RE: cp1252 decoder implementation

2012-11-17 Thread Shawn Steele
IMO this isn't worth the effort being spent on it.  MOST encodings have all 
sorts of interesting quirks, variations, OEM or App specific behavior, etc.  
These are a few code points that haven't really caused much confusion, and 
other code pages are much more confusing (like the CJK ones in particular).

I'd be much happier spending effort on getting apps to UTF-8 than trying to 
resolve esoteric quirks of legacy encodings.  Even if you get that CP perfect, 
someone's gonna enter any of a bajillion characters on that page's HTML 5 web 
form that'll turn into ? at best.

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Buck Golemon
Sent: Saturday, November 17, 2012 8:35 AM
To: verd...@wanadoo.fr
Cc: Doug Ewell; unicode
Subject: Re: cp1252 decoder implementation

 So don't say that there are one-for-one equivalences.

I was just quoting this section of the standard: 
http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

 There is a simple, one-to-one mapping between 7-bit (and 8-bit) control codes 
 and the Unicode control codes: every 7-bit (or 8-bit) control code is 
 numerically equal to its corresponding Unicode code point.

A one-to-one equivalency between bytes and unicode-points is exactly what is 
specified here, limited to the domain of 8-bit control codes.

On Fri, Nov 16, 2012 at 9:48 PM, Philippe Verdy 
verd...@wanadoo.frmailto:verd...@wanadoo.fr wrote:
If you are thinking about byte values you are working at the encoding scheme 
level (in fact another lower level which defines a protocol presentation layer, 
e.g. transport syntaxes in MIME). Unicode codepoints are conceptually not an 
encoding scheme, just a coded character set (independant of the encoding 
scheme).

Separate the levels of abstraction and you'll be much more fine. Forget the 
apparent homonymies that exist between distinct layers of abstraction and use 
each standard in what it is designed for (including the Unicode 
character/glyph model which is not defining an encoding scheme).

So don't say that there are one-for-one equivalences. This is wrong : the 
adaptation layer must exist between abstraction levels and between separate 
standards, but the Unicode standard does not specify them completely (with the 
only exception of standard UTF encodings schemes, which is just one possible 
adaptation across some abstraction levels, but is not made to adapt alone to 
other standards than what is in the Unicode standard itself).


2012/11/17 Buck Golemon b...@yelp.commailto:b...@yelp.com
On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell 
d...@ewellic.orgmailto:d...@ewellic.org wrote:
Buck Golemon wrote:
Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
to map it to the equally-non-semantic U+81 ?

This would allow systems that follow the html5 standard and use cp1252
in place of latin1 to continue to be binary-faithful and reversible.

This isn't quite as black-and-white as the question about Latin-1. If you are 
targeting HTML5, you are probably safe in treating an incoming 0x81 (for 
example) as either U+0081 or U+FFFD, or throwing some kind of error.

Why do you make this conditional on targeting html5?

To me, replacement and error is out because it means the system loses data or 
completely fails where it used to succeed.
Currently there's no reasonable way for me to implement the U+0081 option other 
than inventing a new cp1252+latin1 codec, which seems undesirable.

HTML5 insists that you treat 8859-1 as if it were CP1252, so it no longer 
matters what the byte is in 8859-1.

I feel like you skipped a step. The byte is 0x81 full stop. I agree that it 
doesn't matter how it's defined in latin1 (also it's not defined in latin1).
The section of the unicode standard that says control codes are equal to their 
unicode characters doesn't mention latin1. Should it?
I was under the impression that it meant any single-byte encoding, since it 
goes out of its way to talk about 8-bit control codes.




RE: cp1252 decoder implementation

2012-11-16 Thread Shawn Steele
People really should be using UTF-8 or something else :)   IMO these are legacy 
encodings and should be deprecated.

-Shawn

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Doug Ewell
Sent: Friday, November 16, 2012 4:11 PM
To: Buck Golemon; unicode
Subject: Re: cp1252 decoder implementation

Buck Golemon wrote:

 Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and 
 to map it to the equally-non-semantic U+81 ?

 This would allow systems that follow the html5 standard and use cp1252 
 in place of latin1 to continue to be binary-faithful and reversible.

This isn't quite as black-and-white as the question about Latin-1. If you are 
targeting HTML5, you are probably safe in treating an incoming
0x81 (for example) as either U+0081 or U+FFFD, or throwing some kind of error. 
HTML5 insists that you treat 8859-1 as if it were CP1252, so it no longer 
matters what the byte is in 8859-1.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell  









RE: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-20 Thread Shawn Steele
A) it can use quoted-printable
B) See RFC 6532/6530 - Now it can be UTF-8 :)

-Shawn










RE: Flag tags

2012-05-31 Thread Shawn Steele
 We are missing the JOLLY ROGER.

At least one, there're lots :)

http://en.wikipedia.org/wiki/Pirate_flag#Jolly_Roger_gallery






RE: Flag tags

2012-05-31 Thread Shawn Steele
 We are missing the JOLLY ROGER.
 
 At least one, there're lots :)
 
 http://en.wikipedia.org/wiki/Pirate_flag#Jolly_Roger_gallery

 A, glyph variants. 

Ar, you're right, missed that :)

-Shawn






RE: Flag tags

2012-05-31 Thread Shawn Steele
Which ones are used in print?  Isn't that the criteria?  Personally, I'd like 
to see the maritime flags encoded, because I've always been interested in them, 
but I can see a case for them not being encoded.  (Though a couple weeks ago on 
a cruise ship I did see them used in several places in print as it were, 
though I'd have to concede that the reason they were in print was primarily 
decorative, though they were readable.  Eg: Signals bar spelled out in flags).

Seems like swimming flags or shark flags or dive flags wouldn't be used much in 
print?

-Shawn

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Asmus Freytag
Sent: Poʻahā, Mei 31, 2012 9:00 AM
To: verd...@wanadoo.fr
Cc: Michael Everson; unicode Unicode Discussion
Subject: Re: Flag tags

On 5/31/2012 2:06 AM, Philippe Verdy wrote:
 2012/5/31 Asmus Freytagasm...@ix.netcom.com:
 On 5/30/2012 7:19 PM, Philippe Verdy wrote:
 2012/5/31 Michael Eversonever...@evertype.com:
 On 31 May 2012, at 00:24, Mark Davis ☕ wrote:
 Members of ISO National Bodies quite properly thought that it is 
 inapprioprate for an International Standard to encode the flags of 
 some countries and not the flags of others. You can stuff your 
 condescension, Mark.
 I fully agree. Either all of them or none of them (or just a generic 
 white flag).
 No at least the black pirate flag, and the checkered flag (for car racing).
 There are two black pirate flags. One is all black (the most generic 
 one), another has bones and skullhead. OK these ones are generic 
 enough to not convey country/territory specific information.

 There are also conventional sky blue flags used in Europe (may be
 elsewhere) for the quality of waters. There may be others used for 
 signaling (including surveillance of beaches and dangers for swimming
 : red, orange, green) : may be unified with the all-black flag (if 
 color is not really encoded but assignable by external styles).

 If you add the flag cor car racing, then why wouldn't there flags used 
 in other transportation areas ?
You are right! I missed these:

 Add also flags used as maritime alphabets (they are a true script by 
 themselves, whose mapping to actual letters depend on the locale's 
 script, so they are not really a visual variant of any script, just 
 like the Braille script is not tied to Latin), or othe ideographic
 flags displayed much like the pirate flag (e.g. signaling deceases on 
 board)...










RE: Flag tags

2012-05-31 Thread Shawn Steele
 First, reprinting Shakespeare's works using flags would make it immediately

 and utterly illegible to most speakers of English. So they would fail the test

 of being recognizably the same letter.



FWIW: The Alpha flag doesn't mean A.  For example it also means Diver 
Down.  Most of the flags have other meanings beyond just a letter, like Quebec 
 Quarantine.  So it's not just a substitution cipher.  Combinations can also 
have special meanings.  Additionally, repeaters make it more complicated than a 
simple substitution cipher,  eg: November, Oscar, Repeat2, Repeat1 for noon == 
4 different flags for 2 letters.


[Description: ICS 
November.svg]http://en.wikipedia.org/wiki/File:ICS_November.svg

[Description: ICS Oscar.svg]http://en.wikipedia.org/wiki/File:ICS_Oscar.svg

[Description: ICS Repeat 
Two.svg]http://en.wikipedia.org/wiki/File:ICS_Repeat_Two.svg

[Description: ICS Repeat 
One.svg]http://en.wikipedia.org/wiki/File:ICS_Repeat_One.svg






-Shawn


inline: image001.pnginline: image002.pnginline: image003.pnginline: image004.png

RE: Encoding Standard (mostly complete)

2012-04-20 Thread Shawn Steele
 I think having a single specification to address all encoding questions is 
 useful. 
 It presents encoding algorithms in a consistent style and gives other 
 specifications an simple reference.

Unfortunately this document doesn't own any of the other standards that it's 
summarizing.  As an software developer I'd rather go to the source than an 
intermediary document that may have inadvertently introduced a discrepancy from 
the actual authoritative standards.

I'm not suggesting that a single source wouldn't be nice, but rather that it's 
impractical.  Unless you can get Unicode to cede the definition of UTF-8 to 
your document, and the same with all the other standards, they're bound to be 
inconsistent or diverge.

It may be better as a pointer to the other standards, /or documentation of 
quirks where other standards have been implemented differently and the pros  
cons of those standards.

-Shawn





Klingon on Unicode site?

2012-04-03 Thread Shawn Steele
I was amused to see Klingon on the 
http://www.unicode.org/versions/Unicode6.1.0/ page ;-)

Yes, I realize it’s primarily me and maybe a few other geeks, but I still 
smiled.

[cid:image001.png@01CD1178.97471C20]

- Shawn

 
http://blogs.msdn.com/shawnste

inline: image001.png

RE: Klingon on Unicode site?

2012-04-03 Thread Shawn Steele
April 3rd, missed by a few days ☺

My assumption is the page uses JS to get the dates?  Since my user locale 
happened to be set to Klingon, that’s what it displayed.  But it was not the 
first place I expected to see Klingon

-Shawn

From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy
Sent: Tuesday, April 3, 2012 9:40 AM
To: Shawn Steele
Cc: unicode@unicode.org
Subject: Re: Klingon on Unicode site?

When was that published on the Unicode website ? On April 1st ?
Le 3 avril 2012 18:03, Shawn Steele 
shawn.ste...@microsoft.commailto:shawn.ste...@microsoft.com a écrit :
I was amused to see Klingon on the 
http://www.unicode.org/versions/Unicode6.1.0/ page ;-)

Yes, I realize it’s primarily me and maybe a few other geeks, but I still 
smiled.

[cid:image001.png@01CD117F.494B68D0]

- Shawn

 
http://blogs.msdn.com/shawnste


inline: image001.png

RE: Klingon on Unicode site?

2012-04-03 Thread Shawn Steele
 When the document is in English, it doesn't make sens to display the footer 
 date in the system locale.
 The locale used for this function should either be that of site, or that of 
 the page.

After all we wouldn’t want a Unicode page to appear like it got contaminated 
with Klingon ;-)

-Shawn


RE: Code2000 on SourceForge (was Re: [indic] Re: Lack of Complex script rendering support on Android)

2012-02-03 Thread Shawn Steele
GPL != do what you want with them :)  For example what Christoph pointed out. 
 You may want to consider a more permissive license if do what you want is 
your intent.

-Shawn
(as myself)
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Kass
Sent: Friday, February 03, 2012 7:17 AM
To: unicode@unicode.org
Cc: l...@dashjr.org
Subject: Re: Code2000 on SourceForge (was Re: [indic] Re: Lack of Complex 
script rendering support on Android)


License already included in SourceForge download, namely GPLv3.



James Kass

--- On Fri, 2/3/12, Luke-Jr l...@dashjr.orgmailto:l...@dashjr.org wrote:

From: Luke-Jr l...@dashjr.orgmailto:l...@dashjr.org
Subject: Re: Code2000 on SourceForge (was Re: [indic] Re: Lack of Complex 
script rendering support on Android)
To: unicode@unicode.orgmailto:unicode@unicode.org
Cc: James Kass jamesk...@att.netmailto:jamesk...@att.net
Date: Friday, February 3, 2012, 3:09 PM
On Friday, February 03, 2012 9:52:26 AM James Kass wrote:
 All fonts are now of course freeware - simply do what you want with them
 all.

Freeware isn't afaik a legal term.
Could you slap some kind of license on them?
The CC0 or MIT licenses sound like what you might want:
http://creativecommons.org/choose/zero/
http://www.opensource.org/licenses/mit-license.php




RE: Wrong UTF-8 encoders still around?

2011-10-20 Thread Shawn Steele
Define still around :)  Old software never dies... it just hangs around to 
make compatibility problems for a new generation.

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Martin J. Dürst
Sent: Thursday, October 20, 2011 4:00 PM
To: Unicode Mailing List
Cc: Larry Masinter
Subject: Wrong UTF-8 encoders still around?

I'm hoping to get some advice from people with experience with various 
Unicode/transcoding libraries.

RFC 3987 (the current IRI spec) has the following text:

Note: Some older software transcoding to UTF-8 may produce illegal
   output for some input, in particular for characters outside the
   BMP (Basic Multilingual Plane).  As an example, for the IRI with
   non-BMP characters (in XML Notation):
   http://example.com/#x10300;#x10301;#x10302;;
   which contains the first three letters of the Old Italic alphabet,
   the correct conversion to a URI is
   http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82;

We are thinking about removing this because we hope that software has improved 
in the meantime, but we would like to be sure about this.

If anybody knows about software out there that still presents this problems, 
please tell us.

Thanks,Martin.






  1   2   >