from:"Mark Davis ☕️"

Re: Proposal: Duration

2019-03-04 Thread Mark Davis ☕️

Sadly, time is not that simple. Most people using calendars consider the
duration between January 15 and March 15 to be exactly 2 months. But such
intervals are a different number of days, hence milliseconds.

Mark


On Mon, Mar 4, 2019 at 11:21 AM Naveen Chawla  wrote:

> I don't like it. Duration is just milliseconds for me.
>
> On Mon, 4 Mar 2019 at 18:47 Alexandre Morgaut 
> wrote:
>
>> Here a proposal to make ECMAScript natively support a Duration Object
>>
>> I talked about it a long time ago (2011) on the WHATWG mailing list in
>> the context of the Timers API:
>> https://lists.w3.org/Archives/Public/public-whatwg-archive/2011Feb/0533.htm
>>
>> l think that such a proposal would better take place in the core of the
>> language and having worked on a framework date time APIs I tried to give
>> this approach a better look.
>>
>> ECMAScript natively support Dates since its very first version
>> It started to support the ISO 8601 string format in edition 5
>> (15.9.1.15 Date Time String Format )
>>
>> Durations like Dates can be very tricky, especially with I18n in mind,
>> but the ECMA standard already had to be handled most of the Duration tricky
>> part for the Date Object in EMCA 262 & ECMA 402.
>>
>> Duration, sometimes called TimeInterval, is a common concept supported by
>> most languages or associated standard libs.
>>
>> In very short, Duration object would:
>> - support the ISO syntax in its contructor: new Duration('P6W') // for
>> Period 6 Weeks
>> - allow to handle Date diff operations
>> - allow to be interpreted by setTimeout() & setInterval()
>>
>> Please find below a draft exposing the concept
>> I'd be very happy if someone from TC39 would be interested to champion it
>> https://github.com/AMorgaut/proposal-Duration
>>
>> Regards,
>>
>> Alexandre.
>> ___
>> es-discuss mailing list
>> es-discuss@mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
> ___
> es-discuss mailing list
> es-discuss@mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: add reverse() method to strings

2018-03-18 Thread Mark Davis ☕️

.reverse would only be reasonable for a subset of characters supported by
Unicode. Its primary cited use case is for a particular educational
example, when there are probably thousands of similar examples of educational
snippets that would be rarely used in a production environment. Given that,
it would be far better for those people who really need it to just provide
that to their students as a provided function for the sake of that example.

Mark

On Sun, Mar 18, 2018 at 8:56 AM, Grigory Hatsevich 
wrote:

> "This would remove the challenge and actively worsen their learning
> process" -- this is not true. You can see it e.g. by looking at the
> specific task I was talking about:
>
> "Given a string, find the shortest possible string which can be achieved
> by adding characters to the end of initial string to make it a palindrome."
>
> This is my code for this task:
>
> function buildPalindrome(s){
> String.prototype.reverse=function(){
> return this.split('').reverse().join('')
> }
>
> function isPalindrome(s){
> return s===s.reverse()
> }
> for (i=0;i first=s.slice(0,i);
> rest=s.slice(i);
> if(isPalindrome(rest)){
> return s+first.reverse()
>}
> }
> }
>
>
> As you see, the essence of this challenge is not in the process of
> reversing a string. Having a reverse() method just makes the code more
> readable -- comparing to alternative when one would have to write
> .split('').reverse().join('') each time instead of just .reverse()
>
> On Sun, Mar 18, 2018 at 2:38 PM, Frederick Stark 
> wrote:
>
>> The point of a coding task for a beginner is to practice their problem
>> solving skills to solve the task. This would remove the challenge and
>> actively worsen their learning process
>>
>>
>> On Mar 18 2018, at 6:26 pm, Grigory Hatsevich 
>> wrote:
>>
>>
>> My use case is solving coding tasks about palindromes on codefights.com.
>> Not sure if that counts as "real-world", but probably a lot of beginning
>> developers encounter such tasks at least once.
>>
>>
>>
>>
>> On Sun, 18 Mar 2018 06:41:46 +0700, Mathias Bynens 
>> wrote:
>>
>> So far no one has provided a real-world use case.
>>
>> On Mar 18, 2018 10:15, "Mike Samuel" > >
>> wrote:
>>
>> Previous discussion: https://esdiscuss.org/topic/wi
>> ki-updates-for-string-number-and-math-libraries#content-1
>> 
>>
>> """
>> String.prototype.reverse(), as proposed, corrupts supplementary
>> characters. Clause 6 of Ecma-262 redefines the word "character" as "a
>> 16-bit unsigned value used to represent a single 16-bit unit of text", that
>> is, a UTF-16 code unit. In contrast, the phrase "Unicode character" is used
>> for Unicode code points. For reverse(), this means that the proposed spec
>> will reverse the sequence of the two UTF-16 code units representing a
>> supplementary character, resulting in corruption. If this function is
>> really needed (is it? for what?), it should preserve the order of surrogate
>> pairs, as does java.lang.StringBuilder.reverse:
>> download.oracle.com/javase/7/docs/api/java/lang/StringBui
>> lder.html#reverse()
>> 
>> """
>>
>> On Sat, Mar 17, 2018 at 1:41 PM, Grigory Hatsevich > >
>> wrote:
>>
>> Hi! I would propose to add reverse() method to strings. Something
>> equivalent to the following:
>>
>> String.prototype.reverse = function(){
>>   return this.split('').reverse().join('')
>> }
>>
>> It seems natural to have such method. Why not?
>>
>>
>>
>>
>> ___
>> es-discuss mailing list
>> es-discuss@mozilla.org
>> 
>> https://mail.mozilla.org/listinfo/es-discuss
>>

Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Mark Davis ☕️

I think the cleanest mental model is where UTF-16 or UTF-8 strings are
interpreted as if they were transformed into UTF-32.

While that is generally feasible, it often represents a cost in performance
which is not acceptable in practice. So you see various approaches that
involve some deviation from that mental model.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Wed, Jan 28, 2015 at 2:15 PM, Marja Hölttä ma...@chromium.org wrote:

 For a reference, here's how Java (tried w/ Oracle 1.8.0_31 and openjdk
 1.7.0_65) Pattern.UNICODE_CHARACTER_CLASS works:

 foo\uD834bar and foo\uDC00bar match ^foo[^a]bar$ and ^foo.bar$, so,
 generally, lonely surrogates match /./.

 Backreferences are allowed to consume the leading surrogate of a valid
 surrogate pair:

 Ex1: foo\uD834bar\uD834\uDC00 matches foo(.+)bar\1

 But surprisingly:

 Ex2: \uDC00foobar\uD834\uDC00foobar\uD834 doesn't match ^(.+)\1$

 ... So Ex2 works as if the input string was converted to UTF-32 before
 matching, but Ex1 works as if it was def not. Idk what's the correct mental
 model where both Ex1 and Ex2 would make sense.


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Mark Davis ☕️

Good, that sounds right.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Wed, Jan 28, 2015 at 12:57 PM, André Bargull andre.barg...@udo.edu
wrote:

  On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä marja at chromium.org 
 https://mail.mozilla.org/listinfo/es-discuss wrote:

 * The ES6 unicode regexp spec is not very clear regarding what should happen
 ** if the regexp or the matched string contains lonely surrogates (a lead
 ** surrogate without a trail, or a trail without a lead). For example, for 
 the
 ** . operator, the relevant parts of the spec speak about characters:
 *
 Just a bit of terminology.

 The term character is overloaded, so Unicode provides the unambiguous
 term code point. For example, U+0378 is not (currently) an encoded
 character according to Unicode, but it would certainly be a terrible idea
 to disregard it, or not match it. It is a reserved code point that may be
 assigned as an encoded character in the future. So both U+D83D and U+0378
 are not characters.

 If a ES spec uses the term character instead of code point, then at
 some point in the text it needs to disambiguate what is meant.


 character is defined in 21.2.2 Pattern Semantics [1]:

 In the context of describing the behaviour of a BMP pattern “character”
 means a single 16-bit Unicode BMP code point. In the context of describing
 the behaviour of a Unicode pattern “character” means a UTF-16 encoded code
 point.



 [1]
 https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Q: Lonely surrogates and unicode regexps

2015-01-28 Thread Mark Davis ☕️

On Wed, Jan 28, 2015 at 11:36 AM, Marja Hölttä ma...@chromium.org wrote:

 The ES6 unicode regexp spec is not very clear regarding what should happen
 if the regexp or the matched string contains lonely surrogates (a lead
 surrogate without a trail, or a trail without a lead). For example, for the
 . operator, the relevant parts of the spec speak about characters:


Just a bit of terminology.

The term character is overloaded, so Unicode provides the unambiguous
term code point. For example, U+0378 is not (currently) an encoded
character according to Unicode, but it would certainly be a terrible idea
to disregard it, or not match it. It is a reserved code point that may be
assigned as an encoded character in the future. So both U+D83D and U+0378
are not characters.

If a ES spec uses the term character instead of code point, then at
some point in the text it needs to disambiguate what is meant.

As to how this should be handled in regex expressions, I'd suggest looking
at Java's approach.

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: [Json] JSON: remove gap between Ecma-404 and IETF draft

2013-11-13 Thread Mark Davis ☕

On Wed, Nov 13, 2013 at 3:51 PM, Joe Hildebrand (jhildebr) 
jhild...@cisco.com wrote:

 that all software implementations
 which receive the un-prefixed text will not generate parse errors.


perhaps:

...all conformant software ...



Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Internationalization: Support for IANA time zones

2013-03-02 Thread Mark Davis ☕

T
here are two different issues:

Zone - what we do in CLDR solves the issue. All implementations should, and
as far as I know, do, accept all of the valid TZIDs. Because we use
existing valid TZIDs as the canonical form, even though they are not the
same as in the Zone file, it all works. And we in EcmaScript can reference
the CLDR IDs without requiring the use of any of the localization data;
they are maintained in the timezone.xml file, eg:

http://unicode.org/repos/cldr/tags/release-22-1/common/bcp47/timezone.xml

We have also developed and maintain the short unique IDs that you see
there, so that the Olson TZIDs can be used in locale tags.


Disappearing IDs. Yes, the best that can be done for that is to keep any
old ones despite what the IANA registry does, and deprecate them. That's
also what we do in BCP47 with disappearing ISO codes (ugg). The downside
here is that a different implementation of the IANA timezone APIs may drop
support for the old ones. So we have to decide whether that is a
requirement of implementations or not.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Mar 1, 2013 at 9:40 PM, Norbert Lindenberg 
ecmascr...@lindenbergsoftware.com wrote:

 The identifier issues first:

 On Mar 1, 2013, at 7:40 , Mark Davis ☕ wrote:

   These names are canonicalized to the corresponding Zone name in the
 casing used
 
  Because the Zone names are unstable, in CLDR we adopted the same
 convention as in BCP47. That is, our canonical form never changes, no
 matter what happens to Zone names. I'd strongly recommend using those as
 the canonical names to prevent instabilities.

 The lack of stability in these identifiers is a problem. I don't see
 however how creating our own set (or using a set defined by CLDR) would
 help with interoperability with other systems based on IANA time zone
 names. It reminds me of how the Java Locale class was specified in 1997 to
 use language codes for three languages that ISO had deprecated years
 earlier, forcing Java developers to deal with an incompatibility between
 the Java world (including down to the file names of resource bundles) and
 the rest of the world.

 Has anybody tried to correct this at the source, by writing an RFC with
 maintenance rules for the names in the time zone database, similar to
 what's in BCP 47? How did the folks on the tz mailing list react?

   reject strings that are not registered
 
  That is another problem, since the TZ database does not guarantee
 non-removal (and has removed IDs).

 This one could be corrected by allowing as input all names that were in
 the time zone database at any time after a given date, e.g. after IANA took
 over, and treating them as Link names.

 Norbert
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Internationalization: Support for IANA time zones

2013-03-02 Thread Mark Davis ☕

 It seems we have agreement that the canonicalized IANA names are not good
for formatted strings. I like the CLDR solution, but see it as
implementation dependent. *Maybe there's just no value in trying to define
something in the standard since any implementer can claim that Center,
North Dakota and GMT+09:00 are localized representations for some locale.
* So, leave it all implementation dependent?

I agree. (And you hit on an important point above.)



Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Mar 1, 2013 at 10:33 PM, Norbert Lindenberg 
ecmascr...@lindenbergsoftware.com wrote:

 And the time zone names in formatted output when no localized time zone
 name is available:

 On Feb 28, 2013, at 15:35 , Norbert Lindenberg wrote:

  5) The set of combinations of time zone name and language tag for which
 localized time zone names are available is implementation dependent. Where
 no localized time zone name is available, the canonicalized name is used in
 formatted output.
 
  The last one I'm not entirely comfortable with: IANA time zone names can
 be long and unfamiliar (e.g., America/Indiana/Tell_City), and sometimes
 people think the wrong representative city was selected (e.g., Shanghai
 rather than Beijing for China). An alternative might be to prescribe
 formatting as an offset from UTC.


 On Feb 28, 2013, at 16:13 , Shawn Steele wrote:

  For #5 I might prefer falling back to English or something.  I don't
 think UTC offset is a good idea because that doesn't really represent a
 Timezone very well.  (If a meeting gets moved to a following week, that
 offset might change or be wrong)

 On Mar 1, 2013, at 7:40 , Mark Davis ☕ wrote:

  This is a problematic. The canonicalized names are very ugly. What we do
 in CLDR is return the last label, after some modifications (in
 http://www.unicode.org/repos/cldr/trunk/common/main/root.xml). We don't
 want to return the raw IDs. I think this needs to be implementation
 dependent.
 
  For example:
 
  zone type=Antarctica/DumontDUrville
  exemplarCityDumont d’Urville/exemplarCity
  /zone
  zone type=America/North_Dakota/Center
  exemplarCityCenter, North Dakota/exemplarCity
  /zone
 
  So I think we should just have #5 be:
 
  5) The set of combinations of time zone name and language tag for which
 localized time zone names are available is implementation dependent.

 On Mar 1, 2013, at 9:41 , Phillips, Addison wrote:

  I think the least surprise would result if the GMT+/- string were used
 when no local representation is available. While the actually time zone is
 more specific, most callers are just trying to put a date or time value
 into their output for human consumption. In most cases, the DST transition
 rules are unimportant to a specific date value being rendered and the GMT
 offset is at least somewhat compact. Users are probably more familiar with
 this presentation and certainly will be happier with it than
 America/Los_Angeles.

 It seems we have agreement that the canonicalized IANA names are not good
 for formatted strings. I like the CLDR solution, but see it as
 implementation dependent. Maybe there's just no value in trying to define
 something in the standard since any implementer can claim that Center,
 North Dakota and GMT+09:00 are localized representations for some
 locale. So, leave it all implementation dependent?

 Norbert

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Internationalization: Support for IANA time zones

2013-03-02 Thread Mark Davis ☕

On Sat, Mar 2, 2013 at 5:11 PM, Shawn Steele shawn.ste...@microsoft.comwrote:

 I’m uncomfortable using the CLDR names, although perhaps they could be
 aliases, because other standards use the tzdb names and we have to be able
 to look up the tzdb names.  It might be nice to get more stability for the
 tzdb names, like aliases or something.


I'm simply not explaining myself correctly. *The CLDR IDs (long IDs) *ARE*
tzdb IDs.* Let me give you a specific case.

The TZDB has the equivalence class {Asia/Calcutta Asia/Kolkata}. They used
to have the former as the canonical name (in Zone), but then changed it to
the latter. Here is the current TZDB data:

zone.tab

IN +2232+08822 Asia/Kolkata


asia

Zone Asia/Kolkata 5:53:28 - LMT 1880 # Kolkata
...


backward

Link Asia/Kolkata Asia/Calcutta


Because of the Link, both are valid and equivalent.


CLDR, because we need stability, retains the *former* TZID as the canonical
name. That is the meaning of the first alias in
http://unicode.org/repos/cldr/tags/release-22-1/common/bcp47/timezone.xml such
as in:

type name=inccu alias=*Asia/Calcutta* Asia/Kolkata description=Kolkata,
India/

The short name (name=...) is only used for BCP47 subtags (because of the
ASCII/8-char limit), *not* for communicating with TZDB implementations. Nor
is it used in CLDR for the canonical ID.

Instead Asia/Calcutta is used as the canonical ID. Here are some of the
samples of the localizations.

zone type=Asia/Calcutta
exemplarCityኮልካታ/exemplarCity
/zone
...
zone type=Asia/Calcutta
exemplarCityKolkata/exemplarCity
/zone
...
zone type=Asia/Calcutta
exemplarCityコルカタ/exemplarCity
/zone

and so on. Note that these are all TZIDs.

The only long ID that CLDR adds is one to indicate an unknown/illegal TZID:

zone type=Etc/Unknown
exemplarCity地域不明/exemplarCity
/zone


Is the picture clearer now?


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Internationalization: Support for IANA time zones

2013-03-01 Thread Mark Davis ☕

 These names are canonicalized to the corresponding Zone name in the
casing used

Because the Zone names are unstable, in CLDR we adopted the same convention
as in BCP47. That is, our canonical form never changes, no matter what
happens to Zone names. I'd strongly recommend using those as the canonical
names to prevent instabilities.

 reject strings that are not registered

That is another problem, since the TZ database does not guarantee
non-removal (and *has* removed IDs).

 Where no localized time zone name is available, the canonicalized name is
used in formatted output.
  An alternative might be to prescribe formatting as an offset from UTC.

This is a problematic. The canonicalized names are very ugly. What we do in
CLDR is return the last label, after some modifications (in
http://www.unicode.org/repos/cldr/trunk/common/main/root.xml). We don't
want to return the raw IDs. I think this needs to be implementation
dependent.

For example:

zone type=Antarctica/DumontDUrville
exemplarCityDumont d’Urville/exemplarCity
/zone
zone type=America/North_Dakota/Center
exemplarCityCenter, North Dakota/exemplarCity
/zone

So I think we should just have #5 be:

5) The set of combinations of time zone name and language tag for which
localized time zone names are available is implementation dependent.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Mar 1, 2013 at 3:04 PM, Andrew Paprocki and...@ishiboo.com wrote:

 Norbert, Are you planning on using the Unicode CLDR data? This data has
 the localized exemplar cities for every IANA timezone in every locale. For
 example, America/New_York in Russian:


 http://unicode.org/cldr/trac/browser/tags/release-22-1/common/main/ru.xml#L3906

 We currently use the IANA data for actual datetime computation, but the
 CLDR data is used for things such as display and conversion from Windows
 timezone identifiers - IANA timezone identifiers.

 -Andrew


 On Thu, Feb 28, 2013 at 7:13 PM, Shawn Steele 
 shawn.ste...@microsoft.comwrote:

 For #5 I might prefer falling back to English or something.  I don't
 think UTC offset is a good idea because that doesn't really represent a
 Timezone very well.  (If a meeting gets moved to a following week, that
 offset might change or be wrong)

 -Shawn

 -Original Message-
 From: es-discuss-boun...@mozilla.org [mailto:
 es-discuss-boun...@mozilla.org] On Behalf Of Norbert Lindenberg
 Sent: Thursday, February 28, 2013 3:36 PM
 To: es-discuss
 Subject: Internationalization: Support for IANA time zones

 I'm updating the ECMAScript Internationalization API spec to support the
 names of the IANA Time Zone Database [1] in DateTimeFormat. I'd like to
 highlight a few key points of my draft to see whether there are comments:

 1) The supported names are the Link and Zone names of the IANA Time Zone
 Database. Names are matched ASCII-case-insensitively.

 2) These names are canonicalized to the corresponding Zone name in the
 casing used in the IANA Time Zone Database. Etc/GMT and Etc/UTC are
 canonicalized to UTC. (The subtle differences between GMT and UTC probably
 don't matter to users of this API.)

 3) Implementations must recognize all registered Zone and Link names,
 reject strings that are not registered, and use best available current and
 historical information about their offsets from UTC and their daylight
 saving time rules in calculations. (This is different from language tags
 and currency codes, where we accept strings that fit a general pattern
 without requiring reference to the actual registry. The IANA Time Zone
 Database doesn't specify a general pattern for time zone names, and
 accepting a string for which UTC offset and DST rules aren't known can only
 lead to errors.)

 4) If no time zone name is provided to the DateTimeFormat constructor,
 DateTimeFormat.prototype.resolvedOptions returns the canonicalized Zone
 name of the host environment's current time zone. (This potentially
 incompatible change was pre-announced in a note in section 12.3.3.)

 5) The set of combinations of time zone name and language tag for which
 localized time zone names are available is implementation dependent. Where
 no localized time zone name is available, the canonicalized name is used in
 formatted output.

 The last one I'm not entirely comfortable with: IANA time zone names can
 be long and unfamiliar (e.g., America/Indiana/Tell_City), and sometimes
 people think the wrong representative city was selected (e.g., Shanghai
 rather than Beijing for China). An alternative might be to prescribe
 formatting as an offset from UTC.

 Comments?

 Norbert

 [1] http://www.iana.org/time-zones/
 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss



 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

Re: Flexible String Representation - full Unicode for ES6?

2012-12-21 Thread Mark Davis ☕

The man main complication for compatibility is indexing.

See
http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html

If you look back about a year in this list's archive you'll find a long
discussion.

{phone}
On Dec 21, 2012 9:34 PM, Chris Angelico ros...@gmail.com wrote:

 On Sat, Dec 22, 2012 at 4:09 PM, Erik Arvidsson
 erik.arvids...@gmail.com wrote:
  On Fri, Dec 21, 2012 at 6:45 PM, Chris Angelico ros...@gmail.com
 wrote:
 
  There is an alternative. Python (as of version 3.3) has implemented a
  new Flexible String Representation, aka PEP-393; the same has existed
  in Pike for some time. A string is stored in memory with a fixed
  number of bytes per character, based on the highest codepoint in that
  string - if there are any non-BMP characters, 4 bytes; if any
  U+0100-U+, 2 bytes; otherwise 1 byte. This depends on strings
  being immutable (otherwise there'd be an annoying string-copy
  operation when a too-large character gets put in), which is true of
  ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but
  with the leading 0 bytes elided when they're not needed.
 
  This is how most VMs already work.
 
  I agree with you that it would be a better world if this was the case
  but I don't hear you suggesting how we might be able to change this
  without breaking the web?

 Why, if that's how it's already being done, can't there be an easy way
 to expose it to the script that way? Just flip the Big Red Switch and
 suddenly be fully Unicode-safe? Yes, it's backward-incompatible, but
 if the script can have some kind of marker (like use strict) to show
 that it's compliant, or if the engine can simply be told be
 compliant, we could begin to move forward. Otherwise, we're stuck
 where we are.

 Chris Angelico
 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: API for text editing

2012-10-18 Thread Mark Davis ☕

I don't see how the W3C could supply an API that would be accessible from
Javascript, or am I misunderstanding?

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Thu, Oct 18, 2012 at 2:55 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 Hi Mark,

 API to support text editing applications is an important topic, but I'm
 afraid it's beyond the scope that TC 39 generally defines for itself. TC 39
 primarily defines the ECMAScript language, and then adds some core API
 that's required in all environments where the language might be used. API
 related to fonts would likely come well after I/O, which itself isn't on
 any roadmap I've seen.

 A better venue for this proposal might be the W3C, and Boris Zbarsky
 pointed me at some related work that's already going on there:

  - finding the width of a string
 
  http://dev.w3.org/csswg/cssom-view/#extensions-to-the-range-interfaceis 
  probably the closest to it...  There is also
 http://www.whatwg.org/specs/web-apps/current-work/multipage/the-canvas-element.html#dom-context-2d-measuretext
 
  It might be worth exposing a text-measurement API independent of canvas,
 of course.
 
  - determine cursor position (with bidi support)
 
  http://dvcs.w3.org/hg/editing/raw-file/tip/editing.html#selections
 
  - detecting whether a glyph is available for a given character
 
  I don't believe there is anything for that right now.


 It might be a good idea to define more clearly what's needed and then
 approach the relevant W3C working group(s), possibly through the
 Internationalization working group.

 Norbert


 On Oct 15, 2012, at 18:14 , Mark Davis ☕ wrote:

  I added the following for discussion:
 
  https://bugs.ecmascript.org/show_bug.cgi?id=798
  https://bugs.ecmascript.org/show_bug.cgi?id=797
 
  Mark
 
  — Il meglio è l’inimico del bene —
 
 
 
  On Mon, Oct 15, 2012 at 5:51 PM, Gillam, Richard gil...@lab126.com
 wrote:
  Hi everybody--
 
  Here are the minutes from the October 5 ES internationalization ad-hoc.
  Sorry it took me so long to get them out…
 
  --Rich Gillam
 
  ECMAScript internationalization meeting
 
  10/5/12, 10:20AM
 
 
 
  Richard Gillam (invited expert), Nebojša Ćirić (Google), Norbert
 Lindenberg (Mozilla), Eric Albright (Microsoft), Allen Wirfs-Brock
 (Mozilla), Jungshik Shin (Google)
 
 
 
  Timeline.  We began with a discussion of the timeline for the next
 version of the internationalization spec.  The first version took over two
 years, and it sounds like it’s impossible to get anything through the
 process in less than a year, so we settled on a year and a half: We think
 we can produce the second version somewhat more quickly than the first one
 because we’re more familiar with the process now, but we still need to
 leave time to get feedback.  We’ll target completion for June 2014, to
 present to TC39 in September or November.
 
 
 
  Prioritization.  We spent most of the meeting  going through the “wish
 lists” that were compiled before the meeting, briefly discussing each item,
 and assigning it an approximate priority.  We generally tried to give
 higher priority to things developers couldn’t easily write in ECMAScript
 itself.
 
 
 
  Text segmentation.  Most of the discussion here centered on whether this
 was even a necessary feature in the first place.  There are some people
 writing text editors in JavaScript, and there’s apparently a group doing a
 PDF renderer in JavaScript, but there was still some question of whether
 the functionality was common enough to include in all browsers, especially
 considering the data tables (especially for dictionary-based
 implementations such as Japanese word breaking) can be large.  On the other
 hand, browsers already have to have most of all of this data just to render
 HTML.  Google mentioned they already have a BreakIterator implementation.
  The general consensus was that this feature was medium priority.
 
 
 
  String transformations.  This includes Unicode normalization,
 language-sensitive case conversion, and possible case folding (i.e.,
 converting to a case-independent form of the string—this is generally
 equivalent to converting to upper case except for a few characters that get
 lost in upper case, such as ß).
 
 
 
  The general consensus here is that case conversion and normalization
 both needed to go in the main ECMAScript spec, not into the i18n spec.
  Norbert has a strawman for a normalization API (
 http://wiki.ecmascript.org/doku.php?id=strawman:unicode_normalization )
 that we should push with TC39, and we should simply tighten the definition
 of toLocaleUpperCase() and toLocaleLowerCase() to have them take a locale
 parameter.  Norbert has also put together a strawman for this:
 http://wiki.ecmascript.org/doku.php?id=strawman:case_conversion
 
 
 
  Getting this stuff into the main ES draft was considered high priority;
 we’d like to get it into ES6 is that’s possible

Re: Minutes from 10/5 internationalization ad-hoc meeting

2012-10-15 Thread Mark Davis ☕

I added the following for discussion:

https://bugs.ecmascript.org/show_bug.cgi?id=798
https://bugs.ecmascript.org/show_bug.cgi?id=797

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Mon, Oct 15, 2012 at 5:51 PM, Gillam, Richard gil...@lab126.com wrote:

 Hi everybody--

 Here are the minutes from the October 5 ES internationalization ad-hoc.
  Sorry it took me so long to get them out…

 --Rich Gillam

 ECMAScript internationalization meeting

 10/5/12, 10:20AM

 ** **

 Richard Gillam (invited expert), Nebojša Ćirić (Google), Norbert
 Lindenberg (Mozilla), Eric Albright (Microsoft), Allen Wirfs-Brock
 (Mozilla), Jungshik Shin (Google)

 ** **

 *Timeline.*  We began with a discussion of the timeline for the next
 version of the internationalization spec.  The first version took over
 two years, and it sounds like it’s impossible to get anything through the
 process in less than a year, so we settled on a year and a half: We think
 we can produce the second version somewhat more quickly than the first one
 because we’re more familiar with the process now, but we still need to
 leave time to get feedback.  We’ll target completion for June 2014, to
 present to TC39 in September or November.

 ** **

 *Prioritization.*  We spent most of the meeting  going through the “wish
 lists” that were compiled before the meeting, briefly discussing each item,
 and assigning it an approximate priority.  We generally tried to give
 higher priority to things developers couldn’t easily write in ECMAScript
 itself.

 ** **

 *Text segmentation.*  Most of the discussion here centered on whether
 this was even a necessary feature in the first place.  There are some
 people writing text editors in JavaScript, and there’s apparently a group
 doing a PDF renderer in JavaScript, but there was still some question of
 whether the functionality was common enough to include in all browsers,
 especially considering the data tables (especially for dictionary-based
 implementations such as Japanese word breaking) can be large.  On the
 other hand, browsers already have to have most of all of this data just to
 render HTML.  Google mentioned they already have a BreakIterator
 implementation.  The general consensus was that this feature was medium
 priority.

 ** **

 *String transformations.*  This includes Unicode normalization,
 language-sensitive case conversion, and possible case folding (i.e.,
 converting to a case-independent form of the string—this is generally
 equivalent to converting to upper case except for a few characters that get
 lost in upper case, such as ß).

 ** **

 The general consensus here is that case conversion and normalization both
 needed to go in the main ECMAScript spec, not into the i18n spec.  Norbert
 has a strawman for a normalization API (
 http://wiki.ecmascript.org/doku.php?id=strawman:unicode_normalization )
 that we should push with TC39, and we should simply tighten the definition
 of toLocaleUpperCase() and toLocaleLowerCase() to have them take a locale
 parameter.  Norbert has also put together a strawman for this:
 http://wiki.ecmascript.org/doku.php?id=strawman:case_conversion

 ** **

 Getting this stuff into the main ES draft was considered high priority;
 we’d like to get it into ES6 is that’s possible.

 ** **

 There was no stomach for doing either folding or titlecase.  Eric and
 Norbert pointed out that Unicode titlecasing really doesn’t match any set
 of user expectations: rules for this vary widely and many publishers define
 their own house rules.

 ** **

 *Character properties.*  The big question is whether we just want to
 surface some sort of Unicode-property-test idiom in the Regex API, or
 whether we need a separate, callable API just for doing Unicode property
 queries.  After a lot of discussion, the consensus was to just put this
 into the Regex API and not add any new functions, although we fear it’s too
 late to do that for ES6.  We might do the lower-level API as a fallback
 if this turns out to be true.  The consensus was that this is high
 priority in either case. Norbert was delegated to develop a more specific
 proposal.

 ** **

 *Message formatting.*  The larger ES community seems to think this is
 being addressed with “templates strings,” (formerly “quasi-literals”),
 although this solution doesn’t provide a way to deal with plurals and
 gender (and no one but Allen really liked it).  We agreed this was high
 priority, and delegated Nebojsa to investigate more thoroughly and put
 together a strawman.

 ** **

 *Time zones.*  We agreed to broaden the existing time-zone APIs to allow
 the full generality of time zones, not just UTC and the local time zone,
 and that we would use the IANA (formerly Olson) identifiers.  [This was
 made easier by the fact that IANA is now standardizing the Olson names.]  We
 agreed this change is high priority, and this it only involves minor tweaks
 to the language in the

Re: Calendar issues

2012-09-13 Thread Mark Davis ☕

In ICU, we are using Gregorian eras (AD/BC) as customarily interpreted, and
there is no year zero. There isn't a simple way to get non-era years—and
that form is mostly interesting to techies, not normal people, which is why
we support the era form.

(If someone wanted to do it, you could probably get reasonable results by
taking the input date, parsing with a calendar, and if the year  1, set
the year field to 1-year, get the date pattern for the locale, get the
number pattern for a negative integer in the locale, insert the
prefix/suffix around the year field in the date pattern, and format the
Calendar date. That's be a dozen or two lines of code, but would need some
extra code for exceptions.)

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Thu, Sep 13, 2012 at 8:40 AM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:


 On Sep 13, 2012, at 6:55 , Andrew Paprocki wrote:

  and Explorer formats it as being in the year 1 BC. Safari calculates
 the day
  according to the Julian calendar, all others use the proleptic
 Gregorian
  calendar.
 
  That is very surprising to me. Can anyone comment on why Safari chose
  that implementation?

 Probably because that's the default used for date and time formatting in
 ICU. ICU can be made to use a proleptic calendar by setting the Gregorian
 cutover to the beginning of time; I don't see an easy way to make it
 introduce a year 0.

 Norbert


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Calendar issues

2012-09-13 Thread Mark Davis ☕

I really was not very clear about what I think; sorry for rambling a bit.

Yes, I agree that the best result for Gregorian is to have correct era
support, which means there is no year zero: you have 2 AD, 1 AD, 1 BC, 2
BC,...

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Thu, Sep 13, 2012 at 1:38 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 The output of Date.prototype.toLocaleString and
 DateTimeFormat.prototype.format is also intended for normal people, not for
 techies. So why should we introduce a year 0 for them?

 Norbert


 On Sep 13, 2012, at 13:31 , Mark Davis ☕ wrote:

  In ICU, we are using Gregorian eras (AD/BC) as customarily interpreted,
 and there is no year zero. There isn't a simple way to get non-era
 years—and that form is mostly interesting to techies, not normal people,
 which is why we support the era form.
 
  (If someone wanted to do it, you could probably get reasonable results
 by taking the input date, parsing with a calendar, and if the year  1, set
 the year field to 1-year, get the date pattern for the locale, get the
 number pattern for a negative integer in the locale, insert the
 prefix/suffix around the year field in the date pattern, and format the
 Calendar date. That's be a dozen or two lines of code, but would need some
 extra code for exceptions.)
 
  Mark
 
  — Il meglio è l’inimico del bene —
 
 
 
  On Thu, Sep 13, 2012 at 8:40 AM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:
 
  On Sep 13, 2012, at 6:55 , Andrew Paprocki wrote:
 
   and Explorer formats it as being in the year 1 BC. Safari
 calculates the day
   according to the Julian calendar, all others use the proleptic
 Gregorian
   calendar.
  
   That is very surprising to me. Can anyone comment on why Safari chose
   that implementation?
 
  Probably because that's the default used for date and time formatting in
 ICU. ICU can be made to use a proleptic calendar by setting the Gregorian
 cutover to the beginning of time; I don't see an easy way to make it
 introduce a year 0.
 
  Norbert
 
 


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Calendar issues

2012-09-12 Thread Mark Davis ☕

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Wed, Sep 12, 2012 at 8:43 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 ES5 section 15.9.1 specifies a number of operations to map time values
 (measured in milliseconds from January 1, 1970, midnight UTC) to
 year/month/day/hour/minute/second values, and the ECMAScript
 Internationalization API specification section 12.3.2 mandates use of these
 algorithms also for formatting localized date and time strings if the
 Gregorian calendar is used.

 A well-placed test within the SpiderMonkey test suite reminded me of two
 issues related to that:

 - The algorithms use a proleptic Gregorian calendar, that is, apply the
 rules of the Gregorian calendar all the way back to the beginning of
 ECMAScript time. Normal usage, however, is to use the Julian calendar for
 dates before the introduction of the Gregorian calendar in 1582 (and in
 some countries for quite some time after that).



The problem is that the date when countries shifted to using Gregorian as
the primary calendar varies wildly. And it wasn't just from the Julian
calendar; in non-Christian countries, it was from many others (often still
in use as a secondary calendar). And historically, there were transitions
between those calendars. And not everyone in the same country followed the
same calendar, or switched at the same time. And even on the Julian
calendar, before 525 AD there wasn't the practice of using AD; you'd have
to have the year set to the year of the emperor (and let's not talk about
the transitions there...)

Anyway, it typically isn't worth the trouble. It is quite customary to use
a proleptic calendar; many if not most standards do it.


 - The year calculation assumes that there was a year 0, while in normal
 usage the year before 1 AD is 1 BC.


If the implementation supports eras, then you would have 2 AD, 1 AD, 1 BC,
2 BC...

If it uses negative proleptic years, then you'd have 2 AD, 1 AD, 0 AD, -1
AD, -2 AD, ...

I don't know that we want to have such a difference be a gating item


 With regards to the first issue, the November 2011 draft of the spec had
 limited applicability of the 15.9.1 algorithms to the internationalization
 API to dates after 1930, but then somebody (I forgot who) convinced me to
 remove that limitation. I don't think the second issue has ever been
 discussed, and introducing a year 0 where there was none just seems wrong
 to me.

 Current implementations of Date.prototype.toLocaleString are split:
 Firefox, Chrome, and Opera format a date in 1 BC as being in the year 0,
 while Safari formats it as being in the year 1 (means BC but doesn't say
 so) and Explorer formats it as being in the year 1 BC. Safari calculates
 the day according to the Julian calendar, all others use the proleptic
 Gregorian calendar.

 Thoughts?

 Depending on what we decide, the beginning of ECMAScript time could be
 anywhere between 271816 BC and 271822 BC...

 Regards,
 Norbert

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Calendar issues

2012-09-12 Thread Mark Davis ☕

+Peter, since he has an interest in these issues.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Wed, Sep 12, 2012 at 9:37 PM, Mark Davis ☕ m...@macchiato.com wrote:



 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Wed, Sep 12, 2012 at 8:43 PM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:

 ES5 section 15.9.1 specifies a number of operations to map time values
 (measured in milliseconds from January 1, 1970, midnight UTC) to
 year/month/day/hour/minute/second values, and the ECMAScript
 Internationalization API specification section 12.3.2 mandates use of these
 algorithms also for formatting localized date and time strings if the
 Gregorian calendar is used.

 A well-placed test within the SpiderMonkey test suite reminded me of two
 issues related to that:

 - The algorithms use a proleptic Gregorian calendar, that is, apply the
 rules of the Gregorian calendar all the way back to the beginning of
 ECMAScript time. Normal usage, however, is to use the Julian calendar for
 dates before the introduction of the Gregorian calendar in 1582 (and in
 some countries for quite some time after that).



 The problem is that the date when countries shifted to using Gregorian as
 the primary calendar varies wildly. And it wasn't just from the Julian
 calendar; in non-Christian countries, it was from many others (often still
 in use as a secondary calendar). And historically, there were transitions
 between those calendars. And not everyone in the same country followed the
 same calendar, or switched at the same time. And even on the Julian
 calendar, before 525 AD there wasn't the practice of using AD; you'd have
 to have the year set to the year of the emperor (and let's not talk about
 the transitions there...)

 Anyway, it typically isn't worth the trouble. It is quite customary to use
 a proleptic calendar; many if not most standards do it.


 - The year calculation assumes that there was a year 0, while in normal
 usage the year before 1 AD is 1 BC.


 If the implementation supports eras, then you would have 2 AD, 1 AD, 1 BC,
 2 BC...

 If it uses negative proleptic years, then you'd have 2 AD, 1 AD, 0 AD, -1
 AD, -2 AD, ...

 I don't know that we want to have such a difference be a gating item


 With regards to the first issue, the November 2011 draft of the spec had
 limited applicability of the 15.9.1 algorithms to the internationalization
 API to dates after 1930, but then somebody (I forgot who) convinced me to
 remove that limitation. I don't think the second issue has ever been
 discussed, and introducing a year 0 where there was none just seems wrong
 to me.

 Current implementations of Date.prototype.toLocaleString are split:
 Firefox, Chrome, and Opera format a date in 1 BC as being in the year 0,
 while Safari formats it as being in the year 1 (means BC but doesn't say
 so) and Explorer formats it as being in the year 1 BC. Safari calculates
 the day according to the Julian calendar, all others use the proleptic
 Gregorian calendar.

 Thoughts?

 Depending on what we decide, the beginning of ECMAScript time could be
 anywhere between 271816 BC and 271822 BC...

 Regards,
 Norbert

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss



___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: General comment on ES 402 test suite (i18n)

2012-09-11 Thread Mark Davis ☕

Can you reformulate the table attached to
http://unicode.org/cldr/trac/ticket/5302?

In particular, if a currency is not in the LDML table, it gets the default
values (see below). So you need to compare on that basis.

It is much better for comparison if you attach a tab- or comma-delimited
file, so that it can be loaded into a spreadsheet, something like:

Code CLDR ISO
AED 2 2
AFN 0 2
...

We can then review with the currency folk in CLDR the reasons behind any
differences.

http://unicode.org/reports/tr35/#Supplemental_Currency_Data

The fractions element contains any number of info elements, with the
following attributes:

   - *iso4217: *the ISO 4217 code for the currency in question. If a
   particular currency does not occur in the fractions list, then it is given
   the defaults listed for the next two attributes.
   - *digits: *the number of decimal digits normally formatted. The default
   is 2.
   - *rounding: *the rounding increment, in units of 10-digits. The default
   is 1. Thus with fraction digits of 2 and rounding increment of 5, numeric
   values are rounded to the nearest 0.05 units in formatting. With fraction
   digits of 0 and rounding increment of 50, numeric values are rounded to the
   nearest 50.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Sep 11, 2012 at 12:46 PM, Nebojša Ćirić c...@google.com wrote:

 Also, CLDR vs ISO currency bug - http://unicode.org/cldr/trac/ticket/5302


 2012/9/10 Nebojša Ćirić c...@google.com

 Btw. I've filed a bug wrt. de-DD with ICU -
 http://bugs.icu-project.org/trac/ticket/9562


 2012/9/10 Nebojša Ćirić c...@google.com




 2012/9/10 Norbert Lindenberg ecmascr...@norbertlindenberg.com


 On Sep 10, 2012, at 13:24 , Nebojša Ćirić wrote:

  Can you provide bug IDs?
 
 # ICU bug http://bugs.icu-project.org/trac/ticket/9547
'data/test/suite/intl402/ch11/11.3/11.3.2_TRP.js': 'FAIL',
# ICU bug http://bugs.icu-project.org/trac/ticket/9265
'data/test/suite/intl402/ch09/9.2/9.2.5_11_g_ii_2.js': 'FAIL'
 
   I don't have actual bug ID for ISO - CLDR issue (the fraction digits
 for currencies). I'll talk to Mark about it.

 Thanks!

   8/25 EF are from not implementing the i18n support for
 localeCompare and similar functions (yet).
 
  Looking forward to more info on this once you get there.
 
  As soon as we ratify the spec :).

 Would be good to try this before we ratify. Just don't ship it yet :-)

  NativeJSFormatter is V8 C++ method and it can detect if it was called
 as constructor or not. But by the time I call it it's already too late.
 It's interesting that requirement like this is in ES spec, but they don't
 provide a way to check/enforce it.

 Have you talked to the V8 team about this and the prototype issue?


 I filled a bug about prototype issue -
 http://code.google.com/p/v8/issues/detail?id=2293.
 As for the new/constructor issue they pointed out the internal C++
 method I can't use (as mentioned). I am not sure they can do much there
 without actual ES spec telling them what/how to do it.


2/6 F are from 1x.3_a.js tests, where 0 property of
 Array.prototype is tainted. I don't know how to guard against this. Any
 pointers?
 
  You mean 9.2.1_2.js and 9.2.6_2.js? The spec here refers to the List
 specification type, and I implemented List objects using Array methods that
 you have to grab before anybody can replace them.
 
  Methods are fine, but what do you do with '0' property. You can't
 grab all indices in range to protect override.

 List.prototype = Object.create(null);

  I have to check why:
 
  Object.defineProperty(Intl.Collator, 'prototype', new Intl.Collator())

 I changed the spec a while ago to not use an actual Collator object as
 the prototype object, after Allen and Suzuki-san reported problems with
 this approach. Use Intl.Collator.call({}) with the standard built-in values
 of Intl.Collator and Function.prototype.call instead.


 I'll try that, thanks.

 --
 Nebojša Ćirić




 --
 Nebojša Ćirić




 --
 Nebojša Ćirić

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: ECMAScript collation question

2012-09-05 Thread Mark Davis ☕

That works (for now).

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Wed, Sep 5, 2012 at 11:04 AM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 It was too weak indeed - I added the requirement that normalization is
 turned on by default.

 Norbert


 On Sep 4, 2012, at 13:23 , Mark Davis ☕ wrote:

  In view of the schedule, I suggest that we make your first, minimal
 change right now, and plan to correct it along one of the other lines in
 the next edition.
 
  #1 is much weaker than we want, so we should correct it, but we can do
 that in edition 2.
 
  Mark
 
  — Il meglio è l’inimico del bene —
 
 
 
  On Tue, Sep 4, 2012 at 12:35 PM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:
  Seeing that the final draft of the spec is due today, here's a breakdown
 of possible changes around normalization in Collator:
 
  1) Change the description of Intl.Collator.prototype.compare to say:
 The method is required to return 0 when comparing Strings that are
 considered canonically equivalent by the Unicode standard, unless collator
 has a [[normalization]] internal property whose value is false.
 
  This is the smallest possible change to the spec that's needed to make
 its canonical equivalence and normalization requirements consistent, and
 I've made it.
 
  2) Require support for the normalization property and the kk key.
 
  The way I phrased the spec in 1), this isn't necessary anymore, and we
 can make this change in the second edition if needed.
 
  3) Add locale to the set of acceptable input values for the
 normalization property of options. Implementations that support the
 normalization property would use the selected locale's default for the kk
 key. The normalization property of the object returned by resolvedOptions
 remains a boolean.
 
  This change could be made today or in the second edition. If we make it
 in the second edition, implementations of the first edition would interpret
 locale as true because locale is truthy. The conformance clause does
 not allow implementations to add support for this value on their own.
 
  4) Add locale to the set of acceptable values of the kk key of BCP 47.
 The Internationalization API would use this, if the normalization property
 of options is undefined, to map to the appropriate boolean value.
 
  This can't happen today, and I'm not sure it's really required. Turning
 off normalization is primarily an optimization and so should be under
 application control.
 
  Comments?
 
  Norbert
 
 
  On Sep 1, 2012, at 16:19 , Mark Davis ☕ wrote:
 
Support for the normalization property in options and the kk key
 would become mandatory.
  
   The options that ICU offers are to observe full canonical equivalence:
 • For all locales
 • kk=true
 • For key locales (where it is necessary); otherwise partial
 (FCD)
 • kk=not present
 • For no locales; always partial (FCD)
 • kk=false
   Your proposal looks reasonable, except I'm not sure how someone would
 use the kk value to get #2.
  
   Mark
  
   — Il meglio è l’inimico del bene —
  
  
  
   On Fri, Aug 31, 2012 at 3:30 PM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:
   I think #2 is far more common for ECMAScript - typical use would be to
 re-sort a list of a few dozen or at most a few hundred entries and then
 redisplay that list. #1 might become more common though as JavaScript use
 on the server progresses.
  
   So here's an alternative spec approach:
  
   - Leave the specification of String.prototype.localeCompare as is.
 That is, if it's not based on Collator, canonical equivalence - 0 is
 required.
  
   - For Collator.prototype.compare, require that canonical equivalence
 - 0 unless the client explicitly turns off normalization (i.e.,
 normalization is on by default, independent of locale). Support for the
 normalization property in options and the kk key would become mandatory.
  
   Norbert
  
  
   On Aug 31, 2012, at 10:12 , Mark Davis ☕ wrote:
  
I think we could go either way. It depends on the usage mode.
  • The case where performance is crucial is where you are
 comparing gazillions of strings, such as records in a database.
  • If the number of strings to be compared is relatively small,
 and/or there is enough overhead anyway, the performance win by turning off
 full normalization would be lost in the noise.
So if #2 is the expected use case, we could require full
 normalization.
   
   
Mark
   
— Il meglio è l’inimico del bene —
   
   
   
On Fri, Aug 31, 2012 at 9:56 AM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:
The question for ECMAScript then is whether we should stick with
 must do (the current state of the specifications) or change to must be
 able to do.
   
The changes for must be able to do would be:
   
- In the Language specification, remove

Re: ECMAScript collation question

2012-09-04 Thread Mark Davis ☕

In view of the schedule, I suggest that we make your first, minimal change
right now, and plan to correct it along one of the other lines in the next
edition.

#1 is much weaker than we want, so we should correct it, but we can do that
in edition 2.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Sep 4, 2012 at 12:35 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 Seeing that the final draft of the spec is due today, here's a breakdown
 of possible changes around normalization in Collator:

 1) Change the description of Intl.Collator.prototype.compare to say: The
 method is required to return 0 when comparing Strings that are considered
 canonically equivalent by the Unicode standard, unless collator has a
 [[normalization]] internal property whose value is false.

 This is the smallest possible change to the spec that's needed to make its
 canonical equivalence and normalization requirements consistent, and I've
 made it.

 2) Require support for the normalization property and the kk key.

 The way I phrased the spec in 1), this isn't necessary anymore, and we can
 make this change in the second edition if needed.

 3) Add locale to the set of acceptable input values for the
 normalization property of options. Implementations that support the
 normalization property would use the selected locale's default for the kk
 key. The normalization property of the object returned by resolvedOptions
 remains a boolean.

 This change could be made today or in the second edition. If we make it in
 the second edition, implementations of the first edition would interpret
 locale as true because locale is truthy. The conformance clause does
 not allow implementations to add support for this value on their own.

 4) Add locale to the set of acceptable values of the kk key of BCP 47.
 The Internationalization API would use this, if the normalization property
 of options is undefined, to map to the appropriate boolean value.

 This can't happen today, and I'm not sure it's really required. Turning
 off normalization is primarily an optimization and so should be under
 application control.

 Comments?

 Norbert


 On Sep 1, 2012, at 16:19 , Mark Davis ☕ wrote:

   Support for the normalization property in options and the kk key would
 become mandatory.
 
  The options that ICU offers are to observe full canonical equivalence:
• For all locales
• kk=true
• For key locales (where it is necessary); otherwise partial (FCD)
• kk=not present
• For no locales; always partial (FCD)
• kk=false
  Your proposal looks reasonable, except I'm not sure how someone would
 use the kk value to get #2.
 
  Mark
 
  — Il meglio è l’inimico del bene —
 
 
 
  On Fri, Aug 31, 2012 at 3:30 PM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:
  I think #2 is far more common for ECMAScript - typical use would be to
 re-sort a list of a few dozen or at most a few hundred entries and then
 redisplay that list. #1 might become more common though as JavaScript use
 on the server progresses.
 
  So here's an alternative spec approach:
 
  - Leave the specification of String.prototype.localeCompare as is. That
 is, if it's not based on Collator, canonical equivalence - 0 is required.
 
  - For Collator.prototype.compare, require that canonical equivalence -
 0 unless the client explicitly turns off normalization (i.e., normalization
 is on by default, independent of locale). Support for the normalization
 property in options and the kk key would become mandatory.
 
  Norbert
 
 
  On Aug 31, 2012, at 10:12 , Mark Davis ☕ wrote:
 
   I think we could go either way. It depends on the usage mode.
 • The case where performance is crucial is where you are
 comparing gazillions of strings, such as records in a database.
 • If the number of strings to be compared is relatively small,
 and/or there is enough overhead anyway, the performance win by turning off
 full normalization would be lost in the noise.
   So if #2 is the expected use case, we could require full normalization.
  
  
   Mark
  
   — Il meglio è l’inimico del bene —
  
  
  
   On Fri, Aug 31, 2012 at 9:56 AM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:
   The question for ECMAScript then is whether we should stick with must
 do (the current state of the specifications) or change to must be able to
 do.
  
   The changes for must be able to do would be:
  
   - In the Language specification, remove the description of
 String.prototype.localeCompare, and require implementations to follow the
 Internationalization API specification at least for this method, or better
 provide the complete Internationalization API. That way, localeCompare
 acquires support for the normalization property in options, and the -kk-
 key in the Unicode locale extensions

Re: ECMAScript collation question

2012-09-02 Thread Mark Davis ☕

We could propose to the CLDR group adding attribute=default to mean (for
CLDR) the same as missing (at least for kk, if not others).

That would formally work, but would mean than in an ECMAScript context
missing != default, while in other CLDR contexts, missing == default.

May work, but any other thoughts?

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Sun, Sep 2, 2012 at 8:15 AM, Markus Scherer markus@gmail.com wrote:

 On Sat, Sep 1, 2012 at 4:19 PM, Mark Davis ☕ m...@macchiato.com wrote:

 Your proposal looks reasonable, except I'm not sure how someone would use
 the kk value to get #2.


 Could we say kk=default?
 markus

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: ECMAScript collation question

2012-09-01 Thread Mark Davis ☕

 Support for the normalization property in options and the kk key would
become mandatory.

The options that ICU offers are to observe full canonical equivalence:

   1. For all locales
  - kk=true
   2. For key locales (where it is necessary); otherwise partial (FCD)
  - kk=not present
   3. For no locales; always partial (FCD)
  - kk=false

Your proposal looks reasonable, except I'm not sure how someone would use
the kk value to get #2.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Fri, Aug 31, 2012 at 3:30 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 I think #2 is far more common for ECMAScript - typical use would be to
 re-sort a list of a few dozen or at most a few hundred entries and then
 redisplay that list. #1 might become more common though as JavaScript use
 on the server progresses.

 So here's an alternative spec approach:

 - Leave the specification of String.prototype.localeCompare as is. That
 is, if it's not based on Collator, canonical equivalence - 0 is required.

 - For Collator.prototype.compare, require that canonical equivalence - 0
 unless the client explicitly turns off normalization (i.e., normalization
 is on by default, independent of locale). Support for the normalization
 property in options and the kk key would become mandatory.

 Norbert


 On Aug 31, 2012, at 10:12 , Mark Davis ☕ wrote:

  I think we could go either way. It depends on the usage mode.
• The case where performance is crucial is where you are comparing
 gazillions of strings, such as records in a database.
• If the number of strings to be compared is relatively small,
 and/or there is enough overhead anyway, the performance win by turning off
 full normalization would be lost in the noise.
  So if #2 is the expected use case, we could require full normalization.
 
 
  Mark
 
  — Il meglio è l’inimico del bene —
 
 
 
  On Fri, Aug 31, 2012 at 9:56 AM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:
  The question for ECMAScript then is whether we should stick with must
 do (the current state of the specifications) or change to must be able to
 do.
 
  The changes for must be able to do would be:
 
  - In the Language specification, remove the description of
 String.prototype.localeCompare, and require implementations to follow the
 Internationalization API specification at least for this method, or better
 provide the complete Internationalization API. That way, localeCompare
 acquires support for the normalization property in options, and the -kk-
 key in the Unicode locale extensions.
 


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: ECMAScript collation question

2012-08-31 Thread Mark Davis ☕

I think we could go either way. It depends on the usage mode.

   1. The case where performance is crucial is where you are comparing
   gazillions of strings, such as records in a database.
   2. If the number of strings to be compared is relatively small, and/or
   there is enough overhead anyway, the performance win by turning off full
   normalization would be lost in the noise.

So if #2 is the expected use case, we could require full normalization.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Fri, Aug 31, 2012 at 9:56 AM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 The question for ECMAScript then is whether we should stick with must do
 (the current state of the specifications) or change to must be able to do.

 The changes for must be able to do would be:

 - In the Language specification, remove the description of
 String.prototype.localeCompare, and require implementations to follow the
 Internationalization API specification at least for this method, or better
 provide the complete Internationalization API. That way, localeCompare
 acquires support for the normalization property in options, and the -kk-
 key in the Unicode locale extensions.

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: ECMAScript collation question

2012-08-30 Thread Mark Davis ☕

ICU *is* always able to compare them as being equal, just by setting the
parameter.

Even if the parameter isn't set, it uses an FCD sort (see
http://unicode.org/notes/tn5/) and canonical closure, which handles most
cases of canonical equivalence. The default is turned on for languages
where the normal+auxiliary exemplar sets contains characters that would
show a difference even with an FCD+closure sort, and can be turned on
always if desired (at some cost in performance; 30% sounds high though).

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Thu, Aug 30, 2012 at 6:30 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 In particular, a conformant implementation must be able to compare any two
 canonical-equivalent strings as being equal, for all Unicode characters
 supported by that implementation.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Unicode support in new ES6 spec draft

2012-07-17 Thread Mark Davis ☕

A string reversal is not exactly a high-runner API, and the simple
codepoint reversal will have pretty bad results where grapheme-cluster ≠
single code point.

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Jul 17, 2012 at 3:03 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 We agreed in November not to add String.prototype.reverse because there
 was no compelling use case for it. Is there now?
 https://mail.mozilla.org/pipermail/es-discuss/2011-November/018581.html

 Norbert


 On Jul 17, 2012, at 14:49 , Brendan Eich wrote:

  Allen Wirfs-Brock wrote:
  On Jul 16, 2012, at 2:57 PM, Mark Davis ☕ wrote:
 
  In order to support backwards iteration (which is sometimes used), we
 should have codePointBefore.
 
  or we can provide a backwards iterator that knows how to parse
 surrogate pairs:
 for (let c of str.backwards) ...
 
  Allen
 
  Kind of a spin-off, but I think a String.prototype.reverse that avoids
 
   s.split('').reverse().join('')
 
  overhead and ES6 Unicode hazard splitting on code unit boundary would be
 swell. It's tiny and matches Array.prototype.reverse but of course without
 observable in-place mutation.
 
  It wouldn't relieve all use-cases for reverse iteration, but we have
 iterators and for-of in ES6, we should use 'em.
 
  /be
 
 
 
 
 
  Mark https://plus.google.com/114199149796022210033
  /
  /
  /— Il meglio è l’inimico del bene —/
  //
 
 
  ___
  es-discuss mailing list
  es-discuss@mozilla.org
  https://mail.mozilla.org/listinfo/es-discuss
  ___
  es-discuss mailing list
  es-discuss@mozilla.org
  https://mail.mozilla.org/listinfo/es-discuss

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Unicode support in new ES6 spec draft

2012-07-16 Thread Mark Davis ☕

In order to support backwards iteration (which is sometimes used), we
should have codePointBefore.

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Mon, Jul 16, 2012 at 2:54 PM, Gillam, Richard gil...@lab126.com wrote:

 Why is it intentional?  I don't see the value in restricting it.  You've
 mentioned you're optimizing for the forward-iteration case and want to have
 a separate API for the backward-iteration case.  What about the
 random-access case?  Is there no such case?  Worse, it seems like if you
 use this API for backward iteration or random access, you don't get an
 error; you just get *the wrong answer*, and that seems dangerous.  [I guess
 the wrong answer is an unpaired surrogate value, which would tip the
 caller off that he's doing something wrong, but that still seems like extra
 code he'd need to write.]

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Quasi-literals and localization

2012-07-12 Thread Mark Davis ☕

I didn't pay enough attention to the whole quasi structure, so I can't
pretend to speak intelligently about that.

We do support a number of different mechanisms for string translation, many
of them extracting strings from source files (including templating source
languages like jsps, soy (aka closure templates). Message formats are not
trivial, especially when it comes to plural and gender support. In some of
these environments, features like plurals and gender are built into the
syntax of the language (like soy, see
http://closure-templates.googlecode.com/svn/trunk/javadoc-complete/com/google/template/soy/msgs/restricted/IcuSyntaxUtils.htmlalthough
the formatting is very ugly).
In others, the programmer has to call a function to interpret a string
representation of the message format (which could be fetched from an
external source).

That can all work fine, with some provisos; there have to be
straightforward programmatic ways to:

1. determine which are the messages in the file that need to be
translated (with some mechanism to skip those that shouldn't be translated,
like a literal 'http://'.)
2. determine the structure of all of the embedded message format strings
and map into a 'lingua franca' structure for translation (we use an XML
structure).
3. carry message descriptions along: these are descriptions of the
entire message, plus meaningful names and descriptions and examples of each
placeholder value.

I don't know whether the quasi structure makes that easier or harder. I
also haven't seen any examples where someone has taken a reasonably complex
file with quasi messages (containing plurals, gender, dates, times,
numbers, multiple placeholders with different orders, etc), extracted the
messages for translation (including all of #3), translated, and regenerated
the alternate language version of the original file (or a modified version
that uses methods to get the translated strings).

I'm not saying it can't be done, or that it hasn't been done (it could well
have been); just that I personally haven't seen it.

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**

On Thu, Jul 12, 2012 at 12:20 AM, Mike Samuel mikesam...@gmail.com wrote:

2012/7/2 Norbert Lindenberg ecmascr...@norbertlindenberg.com:
The quasi-literal proposal discusses at some length text localization
[1]. After reading the section and then realizing that most of the
discussed functionality is not actually part of the proposal, I'd like to
ask that the discussion of localization in this context be removed.

The goal of the quasi-literal proposal is not to define any APIs for
localization but only to show that it could benefit a range of
localization proposals.

1) The discussion of the msg function is very incomplete with regards to
the current state of the art in internationalized message construction.
Modern message formatting libraries include support for plural and gender
handling, which is not discussed here [2], and offer far more comprehensive
number and date formatting than discussed here. The msg function also
doesn't integrate with the ECMAScript Internationalization API [3].

Yep. At the time the Quasi proposal was written, that was very much
in flux. I chatted with Nebosja Ciric, Mark Davis and others last
March and they were planning on contributing to another proposal so I
just focused on explaining how a message extraction - localization -
message reintegration pipeline could work with messages in quasis, and
showing how the various concerns like l10n and security could compose.

2) Quasi-literals are based on the assumption that the pattern strings
are normally embedded in the source code. In internationalization, that's
called hard-coded strings and generally considered a really bad idea.
Normally, you want to separate localizable text and data from the source
code, so that localization can proceed and languages can be added without
changes to the code. I'm aware that some companies are using a localization
process that involves generating localized source files with embedded
strings. This may be viable for web application where the code is hosted on
a server and sent to the browser for execution each time the application is
started. I don't see how it would work for applications that actually run
on the server (e.g., within Node.js) and where the server has to provide
localized responses in different languages for each request. I don't think
it's viable either for applications that are made available through an
application store (e.g., those bu
il
t with PhoneGap or Titanium) and which have to include support for
multiple applications with minimal download size.

What about quasis makes dynamically choosing a message bundle based on
the locale of the current request and substituting for the source
language message particularly difficult?

3) The workaround

Re: Internationalization: Additional values in API

2012-06-26 Thread Mark Davis ☕

I tend to agree with your proposal.

Some caveats below.


--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Jun 26, 2012 at 3:22 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 The TC 39 meeting on 2012-05-21 decided to allow implementations to
 recognize property values for which the specification prescribes an Error
 [1]:

  - 2. Conformance
 - What about already defined properties? Can we add new,
 implementation specific values, like v8Identical for collator sensitivity?
 - We should throw if we don't recognise the value. You may recognise
 additional property values.

 I'd like to propose a more restricted escape hatch, to be added to the
 existing allowances for additional objects, properties, and functions in
 the Conformance clause:

 spec
 In the following cases where the specification requires that a RangeError
 is thrown for unacceptable input values, implementations may define
 additional acceptable input values for which the RangeError is not thrown:
 - The options property localeMatcher in all constructors and
 supportedLocalesOf methods.
 - The options properties usage and sensitivity in the Collator constructor.
 - The options properties style, currencyDisplay, minimumIntegerDigits,
 minimumFractionDigits, maximumFractionDigits, minimumSignificantDigits, and
 maximumSignificantDigits in the NumberFormat constructor.


The ones that are integers it would seem odd to accept others.


 - The options property timeZone in the DateTimeFormat constructor,
 provided that the additional acceptable input values are case-insensitive
 matches of Zone or Link identifiers in the IANA time zone database [2] and
 are canonicalized to Zone identifiers in the casing used in the database
 for DateTimeFormat.resolvedOptions().timeZone, except that Etc/GMT shall
 be canonicalized to UTC.


I agree with your reasoning below, but would I would rather use the CLDR
values in http://unicode.org/repos/cldr/trunk/common/bcp47/timezone.xml,
since they are based on the TZDB but mroe stable. Either just names or
names + aliases.


 - The options properties listed in table 3 in the DateTimeFormat
 constructor.
 - The options property formatMatcher in the DateTimeFormat constructor.
 /spec


 The above prevents additional values in the following cases:

 - Input values that lead to TypeError exceptions. These are usually not
 meaningful extension points.

 - Input values that are boolean. There just aren't additional meaningful
 boolean values.

 - Language tags that are not structurally valid. Structural validity is a
 quite minimal requirement, and BCP 47 itself is very extensible. Allowing
 additional values in the Internationalization API would only create
 confusion.

 - Currency codes that are not well-formed. Here as well, well-formedness
 is a quite minimal requirement, and ISO 4217 itself allows registration of
 any actual new currency codes. Allowing additional values in the
 Internationalization API would only create confusion.

 - Additional keys and values from Unicode Technical Standard 35, Unicode
 Locale Data Markup Language [3]. UTS 35 defines several keys and values
 that we have agreed are not useful for the Internationalization API, so we
 should be able to screen new ones before they're added.


I'm a bit hesitant about the screening, since it may take a while (looking
at history) between updates, unless there is a lighter-weight mechanism.



 - NaN and +/- Infinity in DateTimeFormat.prototype.format. These just
 aren't meaningful time values.


 The most unusual part of the proposed addition to the Conformance clause
 is the mini-specification for additional time zone identifiers. In the
 discussions on DateTimeFormat, we deferred defining support for a larger
 set of time zones because not all implementations are ready to support
 them. If we allow implementations to accept additional values, however,
 it's a pretty safe guess that several implementations will extend the set
 of supported time zones quickly, because applications need it, and it's
 also a pretty safe guess that they'll build their support around the IANA
 time zone IDs [2], and that the values would not be prefixed. There may
 however be inconsistencies around case significance and around
 canonicalization of the names in DateTimeFormat.prototype.resolvedOptions.
 In this situation, I think it would be better to standardize now which
 values may get accepted optionally and how they're processed.

 Comments?

 Regards,
 Norbert


 [1] https://mail.mozilla.org/pipermail/es-discuss/2012-May/022836.html
 [2] http://www.iana.org/time-zones/
 [3] http://unicode.org/reports/tr35/

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org

Re: Unicode normalization

2012-05-29 Thread Mark Davis ☕

This is for v2, right?

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, May 29, 2012 at 5:34 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 The ECMAScript Language Specification 5.1 makes assumptions about source
 text being in Unicode normalization form C (NFC), but doesn't say anything
 that would actually make it so. Implementations, as far as I can tell, have
 also chosen to just assume. This is partially based on the Character
 Model for the World Wide Web: Normalization, which recommends early
 normalization to NFC, but never became a standard.

 I'm proposing to correct this by
 - removing the invalid assumptions from the specification,
 - add a normalization function so that applications can normalize text
 where needed.

 http://wiki.ecmascript.org/doku.php?id=strawman:unicode_normalization

 Comments?

 Regards,
 Norbert

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Internationalization API issues and updates

2012-04-16 Thread Mark Davis ☕

Lgtm
On Mar 26, 2012 4:59 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 While everybody is reviewing the draft specification of the ECMAScript
 Internationalization API [1] in preparation for this week's TC 39 meeting,
 here are a few issues that have come up, with proposed resolutions:


 Issue 1, IsWellFormedLanguageTag (6.2.2), raised by Allen:

 The specification referenced here, RFC 5646 section 2.1, says nothing
 about the case of duplicate extension subtags (example: the duplicate -u-
 in de-u-nu-latn-u-ca-gregory), although RFC 5646 section 2.2.9 and
 section 3.7 say tat duplicate extension subtags are invalid. The
 ResolveLocale abstract operation (Globalization API, 9.2.1) will only
 consider the first extension subtag sequence it sees, and ignore others,
 without giving applications any hint as to what's going on.

 Should IsWellFormedLanguageTag be enhanced to check for duplicate
 extension subtags? And then probably also duplicate variant subtags?

 The one thing I'm sure we don't want is validation against the IANA
 Language Subtag Registry.

 My proposed resolution: Add checking for duplicate extension subtags and
 duplicate variant subtags, and throw exception if they exist.


 Issue 2, CanonicalizeLanguageTag (6.2.3), raised by Allen:

 The spec used to say (before February 23): Implementations are allowed,
 but not required, to also canonicalize each extension subtag sequence
 within the tag according to the canonicalization specified by the standard
 registering the extension, such as RFC 6067 section 2.1.1.

 Allen points out that the result is visible to ECMAScript code, and that
 this is the sort of situation were TC39 prefers to mandate a consistent
 result across all implementations.

 Counterarguments to requiring extension subtag sequence canonicalization:
 1. New extensions are being defined that implementations may not know
 about (and have no need to know about).
 2. For the extension that this API cares about, the -u- extension, a
 comparison of language tags as complete strings isn't very useful because
 different functionality cares about different extension keys - Collator
 about -co- and a few others, NumberFormat about -nu-, and DateTimeFormat
 about -ca-. ResolveLocale picks out the extension keys that are relevant
 for its caller.

 Note that canonicalization according to BCP 47 is mandatory; only the
 additional rules created by extension specifications are currently optional.

 My proposed resolution: Clarify that the quoted statement is only about
 canonicalization rules that go beyond those of BCP 47; don't change the
 behavior. The new wording in the February 23 draft is:
 The specifications for extensions to BCP 47 language tags, such as RFC
 6067, may include canonicalization rules for the extension subtag sequences
 they define that go beyond the canonicalization rules of RFC 5646 section
 4.5. Implementations are allowed, but not required, to apply these
 additional rules.


 Issue 3, InitializeDateTimeFormat and ToDateTimeOptions (13.1.1), raised
 by Nebojša:

 The spec doesn't allow the creation of a formatter with time elements
 only. InitializeDateTimeFormat calls ToDateTimeOptions with arguments true
 for date and false for time. Specifying time elements (hour, minute,
 second) won't trigger needDefault = false since time = false (step 6.), so
 we go to the section that defines date properties (step 7). You end up
 having date elements you didn't ask for in addition to any time elements
 you did ask for.

 My proposed resolution: Replace the date and time parameters of
 ToDateTimeOptions with:
 - required: which component groups are required, values date, time,
 any
 - defaults: which component groups should be filled in if required
 components aren't there, values date, time, all.
 Update the ToDateTimeOptions algorithm as well as the calls to it from
 InitializeDateTimeFormat and Date.prototype.toLocale(|Date|Time)String
 accordingly.


 Issue 4, InitializeDateTimeFormat (13.1.1), raised by me:

 The algorithm looks for a property formatMatch in the options argument.
 The name agreed on in the November 15 meeting of the internationalization
 team was formatMatcher, and there is a parallel property localeMatcher.

 My proposed resolution: Rename the property to formatMatcher. Rename the
 *Match abstract operations in parallel.


 Regards,
 Norbert

 [1]
 http://wiki.ecmascript.org/doku.php?id=globalization:specification_drafts

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-27 Thread Mark Davis ☕

The point of C1 is that you can't interpret the surrogate code point U+DC00
as a *character*, like an a.

Neither can you interpret the reserved code point U+0378 as a *character*,
like a b.

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Mar 27, 2012 at 08:56, Glenn Adams gl...@skynav.com wrote:

 This begs the question of what is the point of C1.


 On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ m...@macchiato.com wrote:

 That would not be practical, nor predictable. And note that the 700K
 reserved code points are also not to be interpreted as characters; by your
 logic all of them would need to be converted to FFFD.

 And in practice, an unpaired surrogate is best treated just like a
 reserved (unassigned) code point. For example, a lowercase operation should
 convert characters with lowercase correspondants to those correspondants,
 and leave *everything* else alone: control characters, format characters,
 reserved code points, surrogates, etc.

 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Mar 27, 2012 at 08:02, Glenn Adams gl...@skynav.com wrote:



 On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ m...@macchiato.comwrote:

 That, as Norbert explained, is not the intention of the standard. Take
 a look at the discussion of Unicode 16-bit string in chapter 3. The
 committee recognized that fragments may be formed when working with UTF-16,
 and that destructive changes may do more harm than good.

 x = a.substring(0, 5) + b + a.substring(5, a.length());
 y = x.substring(0, 5) + x.substring(6, x.length());

 After this operation is done, you want y == a, even if 5 is between
 D800 and DC00.


 Assuming that b.length() == 1 in this example, my interpretation of this
 is that '=', '+', and 'substring' are operations whose domain and co-domain
 are (currently defined) ES Strings, namely sequences of UTF-16 code units.
 Since none of these operations entail interpreting the semantics of a code
 point (i.e., interpreting abstract characters), then there is no violation
 of C1 here.

 Or take:
 output = ;
 for (int i = 0; i  s.length(); ++i) {
   ch = s.charAt(i);
   if (ch.equals('')) {
 ch = '@';
   }
   output += ch;
 }

 After this operation is done, you want a\u{1}b to become 
 a@\u{1}b,
 not a\u{FFFD}\u{FFFD}b.
 It is also an unnecessary burden on lower-level software to always
 check this stuff.


 Again, in this example, I assume that the string literal a\u{1}b
 maps to the UTF-16 code unit sequence:

 0061 0026 D800 DC00 0062

 Given that 'charAt(i)' is defined on (and is indexing) code units and
 not code points, and since the 'equals' operator is also defined on code
 units, this example also does not require interpreting the semantics of
 code points (i.e., interpreting abstract characters).

 However, in Norbert's questions above about isUUppercase(int) and
 toUpperCase(int), it is clear that the domain of these operations are code
 points, not code units, and further, that they requiring interpretation as
 abstract characters in order to determine the semantics of the
 corresponding characters.

 My conclusion is that the determination of whether C1 is violated or not
 depends upon the domain, codomain, and operation being considered.


 Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
 output, then you do need to either convert to FFFD or take some other
 action.

 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Mon, Mar 26, 2012 at 23:11, Glenn Adams gl...@skynav.com wrote:


 On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:

 The conformance clause doesn't say anything about the interpretation
 of (UTF-16) code units as code points. To check conformance with C1, you
 have to look at how the resulting code points are actually further
 interpreted.


 True, but if the proposed language

 A code unit that is in the range 0xD800 to 0xDFFF, but is not part of
 a surrogate pair, is interpreted as a code point with the same value.

 is adopted, then will not this have an effect of creating unpaired
 surrogates as code points? If so, then by my estimation, this *will* 
 increase
 the likelihood of their being interpreted as abstract characters... e.g.,
 if the unpaired code unit is interpreted as a unpaired surrogate code
 point, and some process/function performs *any* predicate or
 transform on that code point, then that amounts to interpreting it as an
 abstract character.

 I would rather see such unpaired code unit either (1) be mapped to
 U+00FFFD, or (2) an exception raised when performing an operation that
 requires conversion of the UTF-16 code unit sequence.


 My proposal interprets the resulting code points in the following
 ways:

 1

Re: Full Unicode based on UTF-16 proposal

2012-03-16 Thread Mark Davis ☕

Whew, a lot of work, Norbert. Looks quite good. My one question is whether
it is worth having a mechanism for iteration.

OLD CODE
for (int i = 0; i  s.length(); ++) {
  var x = s.charAt(i);
  // do something with x
}

Using your mechanism, one would write:

NEW CODE
for (int i = 0; i  s.length(); ++) {
  var x = s.codePointAt(i);
  // do something with x
  if (x  0x) {
++i;
  }
}

In Java, for example, I *really* wish you could write:

DESIRED

for (int codepoint : s) {
  // do something with x
}

However, maybe this kind of iteration is rare enough in ES that it suffices
to document the pattern under NEW CODE.

Thanks for all your work!


 proposal for upgrading ECMAScript to a Unicode version released in this
century

This was amusing; could have said this millennium ;-)
--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Fri, Mar 16, 2012 at 01:55, Erik Corry erik.co...@gmail.com wrote:

 This is very useful, and was surely a lot of work.  I like the general
 thrust of it a lot.  It has a high level of backwards compatibility,
 does not rely on the VM having two different string implementations in
 it, and it seems to fix the issues people are encountering.

 However I think we probably do want the /u modifier on regexps to
 control the new backward-incompatible behaviour.  There may be some
 way to relax this for regexp literals in opted in Harmony code, but
 for new RegExp(...) and for other string literals I think there are
 rather too many inconsistencies with the old behaviour.

 The algorithm given for codePointAt never returns NaN.  It should
 probably do that for indices that hit a trail surrogate that has a
 lead surrogate preceeding it.

 Perhaps it is outside the scope of this proposal, but it would also
 make a lot of sense to add some named character classes to RegExp.

 If we are makig a /u modifier for RegExp it would also be nice to get
 rid of the incorrect case independent matching rules.  This is the
 section that says: If ch's code unit value is greater than or equal
 to decimal 128 and cu's code unit value is less than decimal  128,
 then return ch.

 2012/3/16 Norbert Lindenberg ecmascr...@norbertlindenberg.com:
  Based on my prioritization of goals for support for full Unicode in
 ECMAScript [1], I've put together a proposal for supporting the full
 Unicode character set based on the existing representation of text in
 ECMAScript using UTF-16 code unit sequences:
 
 http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
 
  The detailed proposed spec changes serve to get a good idea of the scope
 of the changes, but will need some polishing.
 
  Comments?
 
  Thanks,
  Norbert
 
  [1]
 https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
 
  ___
  es-discuss mailing list
  es-discuss@mozilla.org
  https://mail.mozilla.org/listinfo/es-discuss
 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: New full Unicode for ES6 idea

2012-02-19 Thread Mark Davis ☕

First, it would be great to get full Unicode support in JS. I know that's
been a problem for us at Google.

Secondly, while I agree with Addison that the approach that Java took is
workable, it does cause problems. Ideally someone would be able to loop (a
very common construct) with:

for (codepoint cp : someString) {
  doSomethingWith(cp);
}

In Java, you have to do:

int cp;
for (int i = 0; i  someString.length(); i += Character.countChar(cp)) {
  cp = someString.codePointAt(i);
  doSomethingWith(cp);
}

There are good reasons for why Java did what it did, basically for
compatibility. But if there is some way that JS can work around those,
that'd be great.

3. There's some confusion about the Unicode terminology. Here's a quick
clarification:

code point: number from 0 to 0x10

character: a code point that is assigned. Eg, 0x61 represents 'a' and is a
character. 0x378 is a code point, but not (yet) a character.

code unit: an encoding 'chunk'.
UTF-8 represents a code point as 1-4 8-bit code units
UTF-16 represents a code point  as 2 or 4 16-bit code units
UTF-32 represents a code point as 1 32-bit code unit.

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Sun, Feb 19, 2012 at 16:00, Cameron McCormack c...@mcc.id.au wrote:

 Brendan Eich:

  To hope to make this sideshow beneficial to all the cc: list, what do
  DOM specs use to talk about uint16 units vs. code points?

 I say code unit as a shorter way of saying 16 bit unsigned integer code
 unit

  
 http://dev.w3.org/2006/webapi/**WebIDL/#dfn-code-unithttp://dev.w3.org/2006/webapi/WebIDL/#dfn-code-unit

 (which DOM4 also links to) and then just code point to refer to 21 bit
 numbers that might correspond to a Unicode character, which you can see
 used in

  
 http://dev.w3.org/2006/webapi/**WebIDL/#dfn-obtain-unicodehttp://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode

 __**_
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/**listinfo/es-discusshttps://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Question about the “full Unicode in strings” strawman

2012-01-25 Thread Mark Davis ☕

You can't use \u10 as syntax, because that could be \u10FF followed by
literal FF. A better syntax is \u{...}, with 1 to 6 digits, values from 0
.. 10.

Mark
*— Il meglio è l’inimico del bene —*
*
*
*
[https://plus.google.com/114199149796022210033]
*



On Wed, Jan 25, 2012 at 10:59, Gillam, Richard gil...@lab126.com wrote:

  The current 16-bit character strings are sometimes uses to store
 non-Unicode binary data and can be used with non-Unicode character encoding
 with up to 16-bit chars.  21 bits is sufficient for Unicode but perhaps is
 not enough for other useful encodings. 32-bit seems like a plausable unit.

 How would an eight-digit \u escape sequence work from an implementation
 standpoint?  I'm assuming most implementations right now use 16-bit
 unsigned values as the individual elements of a String.  If we allow
 arbitrary 32-bit values to be placed into a String, how would you make that
 work?  There seem to only be a few options:

 a) Change the implementation to use 32-bit units.

 b) Change the implementation to use either 32-bit units as needed, with
 some sort of internal flag that specifies the unit size for an individual
 string.

 c) Encode the 32-bit values somehow as a sequence of 16-bit values.

 If you want to allow full generality, it seems like you'd be stuck with
 option a or option b.  Is there really enough value in doing this?

 If, on the other hand, the idea is just to make it easier to include
 non-BMP Unicode characters in strings, you can accomplish this by making a
 long \u sequence just be shorthand for the equivalent sequence in UTF-16:
  \u10 would be exactly equivalent to \udbff\udfff.  You don't have to
 change the internal format of the string, the indexes of individual
 characters stay the same, etc.

 --Rich Gillam
  Lab126

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Question about the “full Unicode in strings” strawman

2012-01-25 Thread Mark Davis ☕

(oh, and I agree with your other points)

Mark
*— Il meglio è l’inimico del bene —*
*
*
*
[https://plus.google.com/114199149796022210033]
*



On Wed, Jan 25, 2012 at 11:11, Mark Davis ☕ m...@macchiato.com wrote:

 You can't use \u10 as syntax, because that could be \u10FF followed by
 literal FF. A better syntax is \u{...}, with 1 to 6 digits, values from 0
 .. 10.

 Mark
 *— Il meglio è l’inimico del bene —*
 *
 *
 *
 [https://plus.google.com/114199149796022210033]
 *



 On Wed, Jan 25, 2012 at 10:59, Gillam, Richard gil...@lab126.com wrote:

  The current 16-bit character strings are sometimes uses to store
 non-Unicode binary data and can be used with non-Unicode character encoding
 with up to 16-bit chars.  21 bits is sufficient for Unicode but perhaps is
 not enough for other useful encodings. 32-bit seems like a plausable unit.

 How would an eight-digit \u escape sequence work from an implementation
 standpoint?  I'm assuming most implementations right now use 16-bit
 unsigned values as the individual elements of a String.  If we allow
 arbitrary 32-bit values to be placed into a String, how would you make that
 work?  There seem to only be a few options:

 a) Change the implementation to use 32-bit units.

 b) Change the implementation to use either 32-bit units as needed, with
 some sort of internal flag that specifies the unit size for an individual
 string.

 c) Encode the 32-bit values somehow as a sequence of 16-bit values.

 If you want to allow full generality, it seems like you'd be stuck with
 option a or option b.  Is there really enough value in doing this?

 If, on the other hand, the idea is just to make it easier to include
 non-BMP Unicode characters in strings, you can accomplish this by making a
 long \u sequence just be shorthand for the equivalent sequence in UTF-16:
  \u10 would be exactly equivalent to \udbff\udfff.  You don't have to
 change the internal format of the string, the indexes of individual
 characters stay the same, etc.

 --Rich Gillam
  Lab126

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss



___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Globalization API: supportedLocalesOf vs. getSupportedLocales

2011-11-28 Thread Mark Davis ☕

Here's the problem.

The very same collator for de is valid for de-DE, de-AT, and de-CH.
In ICU you actually get a functionally-equivalent object back, no matter
which of these you ask for.

However, that collator is *also* valid for other countries where 'de' is
official: de-LU, de-BE, de-LI. Moreover, it is *also* valid for countries
with sizable German speaking populations, including de-US and de-BR. And,
it is *also* valid for German as used in any other country: de-FR, ...,
even de-AQ.

That is, you would not expect any variation in collators between de-DE
and de-US. A German collator is equally valid for both. It is somewhat
arbitrary where any given implementation draws the line in terms of
indicating which locales a valid collator can be returned for.

Mark
*— Il meglio è l’inimico del bene —*
*
*
*
[https://plus.google.com/114199149796022210033]
*



On Tue, Nov 29, 2011 at 02:15, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 The set of locales returned by a getSupportedLocales method would have to
 reflect what's actually supported by a Collator, NumberFormat, or
 DateTimeFormat implementation, so I doubt we'd get to the millions. Many of
 these 6000+ languages are spoken by fewer than 200 people, so certainly not
 in 200+ regions. And even where languages are spoken in many countries,
 there may not be defined regional variants: For example, I speak German and
 live in the U.S., but I don't know of any defined de-US collation, number
 format, or date format (in a German context, I'd use de-DE).

 If we let the application pass in the languages that it's interested in,
 that would probably be based on what a user has requested, so rarely more
 than 10 languages. If English, French, Spanish, and Arabic are on the list,
 you might still get over 100 locales, but that's about it.

 Norbert


 On Nov 28, 2011, at 17:37 , Shawn Steele wrote:

  There are 6000+ languages, and presumably any of them could be spoken in
 200+ regions.  There are additionally many variations of some of these
 languages.  So that's not a thousand locales, that's over a million
 locales.  Additionally there may be legitimate tags an application can
 support that it may have been originally designed for.  (Perhaps a new
 language or region or variant)  For an application that doesn't care much
 about the input locale, that's a lot of room for variety.
 
  For applications that are only localized to a certain number of
 languages, then perhaps a getSupportedLocalizations() would be manageable.
  Again though, that scope is narrow and may be inappropriate to use in
 other contexts.  Eg: my app is localized to only English, but someone
 uploaded French content, does that count?
 
  -Shawn
 
  -Original Message-
  From: es-discuss-boun...@mozilla.org [mailto:
 es-discuss-boun...@mozilla.org] On Behalf Of Norbert Lindenberg
  Sent: Monday, November 28, 2011 5:30 PM
  To: Eric Albright; Peter Constable; Shawn Steele
  Cc: es-discuss
  Subject: Re: Globalization API: supportedLocalesOf vs.
 getSupportedLocales
 
  We invented the supportedLocalesOf method to let applications find out
 which of its requested locales are supported by the implementation of a
 service. A getSupportedLocales function that simply returns the locales
 that the implementation actually supports would be easier to understand,
 and could also be used by an application to implement its own locale
 negotiation. If I remember correctly, we chose not to offer
 getSupportedLocales primarily because the list returned might be huge -
 possibly over 1000 locales.
 
  Maybe we should reconsider this? If an application really wants to have
 a list of 1000 locales, why not let it have it? If we want the ability to
 restrict the list, maybe there can be locale list as a parameter, and we
 return only those supported locales for which a prefix is on the locale
 list passed in? Or is there a more fundamental issue with
 getSupportedLocales?
 
  Thanks,
  Norbert
 
 
  On Nov 21, 2011, at 11:12 , Nicholas C. Zakas wrote:
 
  2. supportedLocalesOf
 
  I find this method name strange - I've read it several times and am
 still not sure I fully understand what it does. Perhaps
 getSupportedLocales() is a better name for this method? (I always prefer
 methods begin with verbs.)
 
  ___
  es-discuss mailing list
  es-discuss@mozilla.org
  https://mail.mozilla.org/listinfo/es-discuss
 
 

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Regex

2011-11-17 Thread Mark Davis ☕

Regex has not been part of scope of the Globalization API work. I wanted to
find out whether any improvements from an internationalization point of
view are being planned, separately.

Some of the problems include:

   - Regex's fail on supplementary characters (above U+). Most of these
   are rather low frequency, but there are a large number of Chinese
   characters, some used in people's names or place names.
  - This also impacts the result of validation in HTML5, such as in
  http://dev.w3.org/html5/spec/Overview.html#the-pattern-attribute
   - The Unicode support is otherwise extremely limited, especially for
   properties. See http://98.245.80.27/tcpc/OSCON2011/gbu.html for a
   comparison to other programming languages. The downside of this is that it
   promotes hard-coded lists because people think they know what characters
   occur in words, etc., but get it wrong.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: i18n meeting mid August @ Google

2011-08-01 Thread Mark Davis ☕

Works for me. (I would need to be out from 11:00-12:30.)

Mark
*— Il meglio è l’inimico del bene —*


On Mon, Aug 1, 2011 at 09:29, Nebojša Ćirić c...@google.com wrote:

 So far we have Monday and Tuesday off the table, and some people hinting
 that Wednesday would work best for them. Anybody has a conflict with 
 *Wednesday,
 August 17th*? I'll schedule a meeting by EOD tomorrow if I don't hear
 otherwise.

 Regards,
  Nebojša Ćirić

 29. јул 2011. 11.27, Nebojša Ćirić c...@google.com је написао/ла:

 Hi all,
  some topics were left unreviewed at the last face-to-face meeting. I
 would like to organize another F2F meeting/teleconference at Google campus
 mid August to finish up work on the first draft. Would week of *Aug 15th
 - 20th* work for you? We could meet on *Tuesday, from 10 - 17h*.

 --
 Nebojša Ćirić




 --
 Nebojša Ćirić

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Fwd: Slide show: Survey of current programming language support for Unicode

2011-08-01 Thread Mark Davis ☕

FYI

About the new BCP47 support in Java:

http://download.oracle.com/javase/tutorial/i18n/locale/extensions.html


The following, comparing Unicode support in programing languages, including
ES.


-- Forwarded message --
From: Karl Williamson pub...@khwilliamson.com
Date: Sat, Jul 30, 2011 at 13:01
Subject: Slide show: Survey of current programming language support for
Unicode
To: unic...@unicode.org unic...@unicode.org


Tom Christiansen recently gave a talk at the OSCON conference concerning the
varying levels of support for Unicode in some current programming languages.
 It is accessible via this link

http://training.perl.com/**OSCON2011/index.htmlhttp://training.perl.com/OSCON2011/index.html

The talk is entitled Unicode Support Shootout, and is is one of three
talks listed there, the other two being Perl specific.  I recommend the html
slide show version, from this link,

http://98.245.80.27/tcpc/**OSCON2011/gbu.htmlhttp://98.245.80.27/tcpc/OSCON2011/gbu.html

Left-click to advance to the next slide in the sequence.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Comments on internationalization API

2011-07-22 Thread Mark Davis ☕

You make some good points (and many that I agree with), but the main issue
is that we are having to produce a model that all the browser vendors can
sign up to. That necessitates some compromises, including some areas where
we can't have a concrete specification because the implementors want the
freedom to implement the functionality in different ways.

If you want to engage more, there is a F2F next week. Cira can get you
details.

Mark
*— Il meglio è l’inimico del bene —*


On Thu, Jul 21, 2011 at 17:14, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 Hi Mark,

 Thanks for your comments! Replies to some of them below. I also noticed
 some additional issues:

 19. DateTimeFormat.prototype.getMonths needs a second parameter {boolean}
 standalone, default value false.

 20. There needs to be a way to determine the actual language, region, and
 options of a Collator, NumberFormat, or DateTimeFormat. E.g., if I request
 ar-MA-u-ca-islamic, did I get exactly what I requested, or
 ar-MA-u-ca-islamicc, ar-MA-u-ca-gregory, ar-u-ca-gregory, or yet something
 else?

 Best regards,
 Norbert


 On Jul 20, 2011, at 9:46 , Mark Davis ☕ wrote:

  I have comments on some of these.
 
  Mark
  — Il meglio è l’inimico del bene —
 
 
  On Tue, Jul 19, 2011 at 01:29, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:
  Hi all,
 
  I'm sorry for not having been able to contribute to the
 internationalization API earlier. I finally have reviewed the straw man [1],
 and am pleased to see that it contains a good subset of internationalization
 functionality to start with. Number and date formatting and collation are
 issues that most applications have to deal with. Collation especially, but
 also date formatting with support for multiple time zones and calendars are
 hard to implement as downloadable libraries.
 
  I have some comments on the details though:
 
  1. In the background section, it might be useful to add that with
 Node.js server-side JavaScript is seeing a rebound, and applications don't
 really want to have to call out to a non-JavaScript server in order to
 handle basic internationalization.
 
  2. In the goals section, I'd qualify the reuse of objects goal as a
 reuse of implementation data structures, or even better replace it with
 measurable performance goals. Reuse of objects that are visible to
 applications has security and privacy implications, especially when loading
 third party code (apps or ads) onto pages [2]. I'd recommend letting
 applications freely construct Collator, NumberFormat, and DateTimeFormat
 objects, but have these objects share implementation objects (such as ICU
 objects) as much as possible. If the API does return shared objects, the
 security issues need to be dealt with, e.g., by specifying that the shared
 objects are immutable.
 
  I think it is reasonable to rephrase this as implementation data
 structures.
 
  3. I'm very uncomfortable with the LocaleInfo class. It seems to pretend
 being the central source of all locale-related information, but can't live
 up to that claim because its design is limited to number and date formatting
 and collation. Developers will need to create other functionality such as
 text segmentation, spelling checking, message lookup, shoe size conversion,
 etc. LocaleInfo appears to perform some magic to derive regions, currencies,
 and possibly time zones, but doesn't specify it, and makes none of it
 available to other internationalization classes. It also does duty as a
 namespace, which looks odd in an EcmaScript standard that otherwise doesn't
 know namespaces.
 
  I don't think it is ideal; I share some of your qualms about it. However,
 it is what we were able to compromise on. Because the LocaleInfo class does
 do the resolution, and that information is available after creation, the
 information is available for other services. And people could (being ES)
 hang services off of their own LocaleInfo class.

 So is this the current recommendation?: A library that provides word break
 and line break functionality should be based on a class MyLocaleInfo, which
 provides WordBreak and LineBreak classes whose constructors clients should
 not call, and wordBreak and lineBreak functions that return objects of these
 classes. An application that uses multiple such libraries (providing
 different sets of internationalized functionality) has to create objects of
 all their LocaleInfo classes so that it can request objects of the classes
 that it actually needs.

 What value do these LocaleInfo classes add, compared to having constructors
 of the actually needed classes that can be called directly?

 Also, the LocaleInfo API, as currently documented, doesn't provide any
 information that a third party internationalization library could use. Some
 comments sound like there should be a property options, but this property
 and the derivation of its values aren't actually documented.

  Other internationalization libraries have a core

Re: Comments on internationalization API

2011-07-20 Thread Mark Davis ☕

I have comments on some of these.

Mark
*— Il meglio è l’inimico del bene —*


On Tue, Jul 19, 2011 at 01:29, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 Hi all,

 I'm sorry for not having been able to contribute to the
 internationalization API earlier. I finally have reviewed the straw man [1],
 and am pleased to see that it contains a good subset of internationalization
 functionality to start with. Number and date formatting and collation are
 issues that most applications have to deal with. Collation especially, but
 also date formatting with support for multiple time zones and calendars are
 hard to implement as downloadable libraries.

 I have some comments on the details though:

 1. In the background section, it might be useful to add that with Node.js
 server-side JavaScript is seeing a rebound, and applications don't really
 want to have to call out to a non-JavaScript server in order to handle basic
 internationalization.

 2. In the goals section, I'd qualify the reuse of objects goal as a reuse
 of implementation data structures, or even better replace it with measurable
 performance goals. Reuse of objects that are visible to applications has
 security and privacy implications, especially when loading third party code
 (apps or ads) onto pages [2]. I'd recommend letting applications freely
 construct Collator, NumberFormat, and DateTimeFormat objects, but have these
 objects share implementation objects (such as ICU objects) as much as
 possible. If the API does return shared objects, the security issues need to
 be dealt with, e.g., by specifying that the shared objects are immutable.


I think it is reasonable to rephrase this as implementation data
structures.


 3. I'm very uncomfortable with the LocaleInfo class. It seems to pretend
 being the central source of all locale-related information, but can't live
 up to that claim because its design is limited to number and date formatting
 and collation. Developers will need to create other functionality such as
 text segmentation, spelling checking, message lookup, shoe size conversion,
 etc. LocaleInfo appears to perform some magic to derive regions, currencies,
 and possibly time zones, but doesn't specify it, and makes none of it
 available to other internationalization classes. It also does duty as a
 namespace, which looks odd in an EcmaScript standard that otherwise doesn't
 know namespaces.


I don't think it is ideal; I share some of your qualms about it. However, it
is what we were able to compromise on. Because the LocaleInfo class does do
the resolution, and that information is available after creation, the
information is available for other services. And people could (being ES)
hang services off of their own LocaleInfo class.



 Other internationalization libraries have a core that anybody can build on
 to create internationalization functionality. In Java, for example, the
 Locale and Currency classes handles a variety of identifier mappings, while
 the ResourceBundle class handles loading of localized data with fallbacks
 [3]. In the Yahoo User Interface library, the Intl module does language
 negotiation and collaborates with the YUI loader in loading localized data
 [4]. I'd suggest separating similar functionality in LocaleInfo from the
 formatting and collation functionality and making it available to all. I
 suspect though that some of the current magic will turn out to be misguided
 when looked at in the clear light of a specification and will need to be
 discarded.

 4. Language IDs in the library should be those of BCP 47, not of Unicode
 LDML. The two are similar, but there are subtle differences, as described in
 the LDML spec: LDML excludes some BCP 47 tags and subtags, adds a separator
 and the root locale, and changes the semantics of some tags [5]. Since BCP
 47 is the dominant standard for language identification, internationalized
 applications have to support it. If an implementation of the
 internationalization API is based on LDML, it should handle the mapping
 from/to BCP 47 itself rather than burdening applications with it.


Every LDML language ID is also a BCP 47 language tag. LDML eliminates some
of the deadwood in BCP47 (the old irregular forms) but has the same
expressive power and somewhat more. There are some codes that are not
defined in BCP47 that turn out to be very important for implementations,
like the Unknown region.

I'm well familiar with both, being an author of each.


 5. The specification mentions that a few Unicode extensions in BCP 47
 (-u-ca-, -u-co-, can be used for specific purposes, but is silent on whether
 other extension are encouraged/allowed/ignored/illegal. This should be
 clarified.


Agreed. What it should add is one line saying that the implementation of any
other BCP47 extensions are implementation dependent.



 6. Region IDs should be those of ISO 3166. The straw man references LDML
 region subtags instead; I haven't been able to find a definition

Fwd: Full Unicode strings strawman

2011-05-19 Thread Mark Davis ☕

Markus isn't on es-discuss, so forwarding

-- Forwarded message --
From: Markus Scherer markus@gmail.com
Date: Wed, May 18, 2011 at 22:18
Subject: Re: Full Unicode strings strawman
To: Allen Wirfs-Brock al...@wirfs-brock.com
Cc: Shawn Steele shawn.ste...@microsoft.com, Mark Davis ☕ 
m...@macchiato.com, es-discuss@mozilla.org es-discuss@mozilla.org

On Mon, May 16, 2011 at 5:07 PM, Allen Wirfs-Brock al...@wirfs-brock.comwrote:

 I agree that application writer will continue for the foreseeable future
 have to know whether or not they are dealing with UTF-16 encoded data and/or
 communicating with other subsystems that expect such data.  However, core
 language support for UTF-32 is a prerequisite for ever moving beyond
 UTF-16APIs and libraries and getting back to uniform sized character
 processing.

This seems to be based on a misunderstanding. Fixed-width encodings are nice
but not required. The majority of Unicode-aware code uses either UTF-8 or
UTF-16, and supports the full Unicode code point range without too much
trouble. Even with UTF-32 you get user characters that require sequences
of two or more code points (e.g., base character + diacritic, Han character
+ variation selector) and there is not always a composite character for such
a sequence.

Windows NT uses 16-bit Unicode, started BMP-only and has supported the full
Unicode range since Windows 2000.
MacOS X uses 16-bit Unicode (coming from NeXT) and supports the full Unicode
range. (Ever since MacOS X 10.0 I believe.) Lower-level MacOS APIs use UTF-8
char* and support the full Unicode range.
ICU uses 16-bit Unicode, started BMP-only and has supported the full range
in most services since the year 2000.
Java uses 16-bit Unicode, started BMP-only and has supported the full range
since Java 5.
KDE uses 16-bit Unicode, started BMP-only and has supported the full range
for years.
Gnome uses UTF-8 and supports the full range.

JavaScript uses 16-bit Unicode, is still BMP-only although most
implementations input and render the full range, and updating its spec and
implementations to upgrade compatibly like everyone else seems like the best
option.

In a programming language like JavaScript that is heavy on string
processing, and interfaces with the UTF-16 DOM and UTF-16 client OSes, a
UTF-32 string model might be more trouble than it's worth (and possibly a
performance hit).

FYI: I proposed full-Unicode support in JavaScript in 2003, a few months
before the committee became practically defunct for a while.
https://sites.google.com/site/markusicu/unicode/es/unicode-2003
https://sites.google.com/site/markusicu/unicode/es/i18n-2003

Best regards,
markus
(Google/ICU/Unicode)
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

2011-05-18 Thread Mark Davis ☕

On Tue, May 17, 2011 at 20:01, Wes Garland w...@page.ca wrote:

 Mark;

 Are you Dr. *Mark E. Davis* (born September 13, 1952 (age 58)), co-founder
 of the Unicode http://en.wikipedia.org/wiki/Unicode project and the
 president of the Unicode 
 Consortiumhttp://en.wikipedia.org/wiki/Unicode_Consortiumsince its 
 incorporation in 1991?


Guilty as charged.




 (If so, uh, thanks for giving me alternatives to Shift-JIS, GB-2312, Big-5,
 et al..those gave me lots of hair loss in the late 90s)


Your welcome. We did it to save ourselves from the hair-pulling we had in
the *80's* over those charsets ;-)



 On 17 May 2011 21:55, Mark Davis ☕ m...@macchiato.com wrote:In the past,
 I have read it thus, pseudo BNF:


 UnicodeString = CodeUnitSequence // D80
 CodeUnitSequence = CodeUnit | CodeUnitSequence CodeUnit // D78
 CodeUnit = anything in the current encoding form // D77


 So far, so good. In particular, d800 is a code unit for UTF-16, since it
 is a code unit that can occur in some code unit sequence in UTF-16.


 *head smack* - code unit, not code point.




 This means that your original assertion -- that Unicode strings cannot
 contain the high surrogate code points, regardless of meaning -- is in fact
 correct.


  That is incorrect.


 Aie, Karumba!

 If we have

- a sequence of code points
- taking on values between 0 and 0x1F

 10


- including high surrogates and other reserved values
- independent of encoding

 ..what exactly are we talking about?  Can it be represented in UTF-16
 without round-trip loss when normalization is not performed, for the code
 points 0 through 0x?


Surrogate code points (U+D800..U+DFFF) can't be represented in any
*UTF*string. They can, be represented in
*Unicode strings* (ones that are not valid UTF strings) with the one
restriction that in UTF-16, they have to be isolated. In practice, we just
don't find that isolated surrogates in Unicode 16-bit strings cause a
problem, so I think that issue has derailed the more important issues
involved in this discu, which are in the API.


 Incidentally, I think this discussion underscores nicely why I think we
 should work hard to figure out a way to hide UTF-16 encoding details from
 user-end programmers.



The biggest issue is the indexing. In Java, for example, iterating through a
string is has some ugly syntax:

int cp;
for (int i = 0; i  string.length(); i += *Character.charCount*(cp)) {
cp = string.*codePointAt*(i);
doSomethingWith(cp);
}

But it doesn't have to be that way; they could have supported, with a little
bit of semantic sugar, something like:

for (int cp : aString) {
  doSomethingWith(cp);
}

If done well, the complexity doesn't have to show to the user. In many
cases, as Shawn pointed out, codepoints are not really the right unit
anyway. What the user may actually need are word boundaries, or grapheme
cluster boundaries, etc. If, for example, you truncate a string on just code
point boundaries, you'll get the wrong answer sometimes.

It is of course simpler, if you are either designing a programming language
from scratch *or* are able to take the compatibility hit, to have the API
for strings always index by code points. That is, from the outside, a string
always looks like it is a simple sequence of code points. There are a couple
of ways to do that efficiently, even where the internal storage is not 32
bit chunks.


 Wes

 --
 Wesley W. Garland
 Director, Product Development
 PageMail, Inc.
 +1 613 542 2787 x 102

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

2011-05-18 Thread Mark Davis ☕

Yes, one of the options for the internal storage of the string class is to
use different arrays depending on the contents.

   1. uint8's if all the codepoint are =FF
   2. uint16's if all the codepoint values =
   3. uint32's otherwise

That way the internal storage always corresponds directly to the code point
index, which makes random access fast. Case #3 occurs rarely, so it is ok if
it takes more storage in that case.

Mark

*— Il meglio è l’inimico del bene —*


On Wed, May 18, 2011 at 14:46, Erik Corry erik.co...@gmail.com wrote:

 2011/5/17 Wes Garland w...@page.ca:
  If you're already storing UTF-8 strings internally, then you are already
  doing something expensive (like copying) to get their code units into
 and
  out of JS; so no incremental perf impact by not having a common UTF-16
  backing store.
 
 
  (As a note, Gecko and WebKit both use UTF-16 internally; I would be
  _really_ surprised if Trident does not.  No idea about Presto.)
 
  FWIW - last I time I scanned the v8 sources, it appeared to use a
  three-representation class, which could store either ASCII, UCS2, or
 UTF-8.
  Presumably ASCII could also be ISO-Latin-1, as both are exact, naive,
  byte-sized UCS2/UTF-16 subsets.

 V8 has ASCII strings and UCS2 strings.  There are no Latin1 strings
 and UTF-8 is only used for IO, never for internal representation.
 WebKit uses UCS2 throughout and V8 is able to work directly on WebKit
 UCS2 strings that are on WebKit's C++ heap.

 I like Shawn Steele's suggestion.

 --
 Erik Corry
 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

2011-05-17 Thread Mark Davis ☕

The wrong conclusion is being drawn. I can say definitively that for the
string a\uD800b.

   - It is a valid Unicode string, according to the Unicode Standard.
   - It cannot be encoded as well-formed in any UTF-x (it is not
   'well-formed' in any UTF).
   - When it comes to conversion, the bad code unit \uD800 needs to be
   handled (eg converted to FFFD, escaped, etc.)

Any programming language using Unicode has the choice of either

   1. allowing strings to be general Unicode strings, or
   2. guaranteeing that they are always well-formed.

There are trade-offs either way, but both are feasible.

Mark

*— Il meglio è l’inimico del bene —*


On Tue, May 17, 2011 at 13:03, Boris Zbarsky bzbar...@mit.edu wrote:

 On 5/17/11 3:29 PM, Wes Garland wrote:

 But the point remains, the FAQ entry you quote talks about encoding a
 lone surrogate, i.e. a code unit, which is not a complete code point.
 You can only convert complete code points from one encoding to another.
 Just like you can't represent part of a UTF-8 code sub-sequence in any
 other encoding. The fact that code point X is not representable in
 UTF-16 has no bearing on its status as a code point, nor its
 convertability to UTF-8.  The problem is that UTF-16 cannot represent
 all possible code points.


 My point is that neither can UTF-8.  Can you name an encoding that _can_
 represent the surrogate-range codepoints?


   From page 90 of the Unicode 6.0 specification, in the Conformance
 chapter:

/D80 Unicode string:/ A code unit sequence containing code units of
a particular Unicode
encoding form.
• In the rawest form, Unicode strings may be implemented simply as
arrays of
the appropriate integral data type, consisting of a sequence of code
units lined
up one immediately after the other.
• A single Unicode string must contain only code units from a single
Unicode
encoding form. It is not permissible to mix forms within a string.



Not sure what (D80) is supposed to mean.


 Sorry, (D80) means per definition D80 of The Unicode Standard,
 Version 6.0


 Ah, ok.  So the problem there is that this is definition only makes sense
 when a particular Unicode encoding form has been chosen.  Which Unicode
 encoding form have we chosen here?

 But note also that D76 in that same document says:

  Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.

 and D79 says:

  A Unicode encoding form assigns each Unicode scalar value to a unique
  code unit sequence.

 and

  To ensure that the mapping for a Unicode encoding form is
  one-to-one, all Unicode scalar values, including those
  corresponding to noncharacter code points and unassigned code
  points, must be mapped to unique code unit sequences. Note that
  this requirement does not extend to high-surrogate and
  low-surrogate code points, which are excluded by definition from
  the set of Unicode scalar values.

 In particular, this makes it clear (to me, at least) that whatever Unicode
 encoding form you choose, a Unicode string can only consist of code units
 encoding Unicode scalar values, which does NOT include high and low
 surrogates.

 Therefore I stand by my statement: if you allow what to me looks like
 arrays UTF-32 code units and also values that fall into the surrogate
 ranges then you don't get Unicode strings.  You get a set of arrays that
 contains Unicode strings as a proper subset.


 OK, that seems like a breaking change.

 Yes, I believe it would be, certainly if done naively, but I am hopeful
 somebody can figure out how to overcome this.


 As long as we worry about that _before_ enshrining the result in a spec,
 I'm all of being hopeful.


 Maybe, and maybe not.  We (Mozilla) have had some proposals to
actually use UTF-8 throughout, including in the JS engine; it's
quite possible to implement an API that looks like a 16-bit array on
top of UTF-8 as long as you allow invalid UTF-8 that's needed to
represent surrogates and the like.


 I understand by this that in the Moz proposals, you mean that the
 invalid UTF-8 sequences are actually valid UTF-8 Strings which encode
 code points in the range 0xd800-0xdfff


 There are no such valid UTF-8 strings; see spec quotes above.  The proposal
 would have involved having invalid pseudo-UTF-ish strings.


  and that these code points were
 translated directly (and purposefully incorrectly) as UTF-16 code units
 when viewed as 16-bit arrays.


 Yep.


  If JS Strings were arrays of Unicode code points, this conversion would
 be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point
 0xdc08, with no incorrect conversion taking place.


 Sorry, no.  See above.


  The only problem is
 if there is an intermediate component somewhere that insists on using
 UTF-16..at that point we just can't represent code point 0xdc08 at all.


 I just don't get it.  You can stick the invalid 16-bit value

Re: Full Unicode strings strawman

2011-05-17 Thread Mark Davis ☕

That is incorrect. See below.

Mark

*— Il meglio è l’inimico del bene —*


On Tue, May 17, 2011 at 18:33, Wes Garland w...@page.ca wrote:

 On 17 May 2011 20:09, Boris Zbarsky bzbar...@mit.edu wrote:

 On 5/17/11 5:24 PM, Wes Garland wrote:

 Okay, I think we have to agree to disagree here. I believe my reading of
 the spec is correct.


 Sorry, but no...  how much more clear can the spec get?


 In the past, I have read it thus, pseudo BNF:

 UnicodeString = CodeUnitSequence // D80
 CodeUnitSequence = CodeUnit | CodeUnitSequence CodeUnit // D78
 CodeUnit = anything in the current encoding form // D77


So far, so good. In particular, d800 is a code unit for UTF-16, since it is
a code unit that can occur in some code unit sequence in UTF-16.



 Upon careful re-reading of this part of the specification, I see that D79
 is also important.  It says that A Unicode encoding form assigns each
 Unicode scalar value to a unique code unit sequence.,


True.


 and further clarifies that The mapping of the set of Unicode scalar values
 to the set of code unit sequences for a Unicode encoding form is
 one-to-one.


True.

This is all consistent with saying that UTF-16 can't contain an isolated
d800.

*However, that only shows that a Unicode 16-bit string (D82) is not the same
as a UTF-16 String (D89), which has been pointed out previously.*
*
*

Repeating the note under D89:


A Unicode string consisting of a well-formed UTF-16 code unit sequence is
said
to be *in UTF-16*. Such a Unicode string is referred to as a *valid UTF-16
string*,
or a *UTF-16 string* for short.

*
*
That is, every UTF-16 string is a Unicode 16-bit string, but *not* vice
versa.

Examples:

   - \u0061\ud800\udc00 is both a Unicode 16-bit string and a UTF-16
   string.
   - \u0061\ud800\udc00 is a Unicode 16-bit string, but not a UTF-16
   string.



 This means that your original assertion -- that Unicode strings cannot
 contain the high surrogate code points, regardless of meaning -- is in fact
 correct.


That is incorrect.



 Which is unfortunate, as it means that we either

1. Allow non-Unicode strings in JS -- i.e. Strings composed of all
values in the set [0x0, 0x1F]
2. Keep making programmers pay the raw-UTF-16 representation tax
3. Break the String-as-uint16 pattern

 I still believe that #1 is the way forward, and that problem of
 round-tripping these values through the DOM is solvable.

 Wes

 --
 Wesley W. Garland
 Director, Product Development
 PageMail, Inc.
 +1 613 542 2787 x 102

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

2011-05-16 Thread Mark Davis ☕

I'm quite sympathetic to the goal, but the proposal does represent a
significant breaking change. The problem, as Shawn points out, is with
indexing. Before, the strings were defined as UTF16.

Take a sample string \ud800\udc00\u0061 = \u{1}\u{61}. Right now,
the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a'
would be at offset 1. This will definitely cause breakage in existing code;
characters are in different positions than they were, even characters that
are not supplemental ones. All it takes is one supplemental character before
the current position and the offsets will be off for the rest of the string.

Faced with exactly the same problem, Java took a different approach that
allows for handling of the full range of Unicode characters, but maintains
backwards compatibility. It may be instructive to look at what they did
(although there was definitely room for improvement in their approach!). I
can follow up with that if people are interested. Alternatively, perhaps
mechanisms can put in place to tell ECMAScript to use new vs old indexing
(Perl uses PRAGMAs for that kind of thing, for example), although that has
its own ugliness.

Mark

*— Il meglio è l’inimico del bene —*


On Mon, May 16, 2011 at 13:38, Wes Garland w...@page.ca wrote:

 Allen;

 Thanks for putting this together.  We use Unicode data extensively in both
 our web and server-side applications, and being forced to deal with UTF-16
 surrogate pair directly -- rather than letting the String implementation
 deal with them -- is a constant source of mild pain.  At first blush, this
 proposal looks like it meets all my needs, and my gut tells me the perf
 impacts will probably be neutral or good.

 Two great things about strings composed of Unicode code points:
 1) .length represents the number of code points, rather than the number of
 pairs used in UTF-16, even if the underlying representation isn't UTF-16
 2) S.charCodeAt(S.indexOf(X)) always returns the same kind of information
 (a Unicode code point), regardless of whether X is in the BMP or not

 If though this is a breaking change from ES-5, I support it
 whole-heartedly but I expect breakage to be very limited. Provided that
 the implementation does not restrict the storage of reserved code points
 (D800-DF00), it should be possible for users using String as immutable
 C-arrays to keep doing so. Users doing surrogate pair decomposition will
 probably find that their code just works, as those code points will never
 appear in legitimate strings of Unicode code points.  Users creating Strings
 with surrogate pairs will need to re-tool, but this is a small burden and
 these users will be at the upper strata of Unicode-foodom.  I suspect that
 99.99% of users will find that this change will fix bugs in their code when
 dealing with non-BMP characters.

 Mike Samuel, there would never a supplement code unit to match, as the
 return value of [[Get]] would be a code point.

 Shawn Steele, I don't understand this comment:

 Also, the “trick” I think, is encoding to surrogate pairs (illegally, since
 UTF8 doesn’t allow that) vs decoding to UTF16.


 Why do we care about the UTF-16 representation of particular codepoints?
 Why can't the new functions just encode the Unicode string as UTF-8 and URI
 escape it?

 Mike Samuel, can you explain why you are en/decoding UTF-16 when
 round-tripping through the DOM?  Does the DOM specify UTF-16 encoding? If it
 does, that's silly.  Both ES and DOM should specify Unicode and let the
 data interchange format be an implementation detail.  It is an unfortunate
 accident of history that UTF-16 surrogate pairs leak their abstraction into
 ES Strings, and I believe it is high time we fixed that.

 Wes

 --
 Wesley W. Garland
 Director, Product Development
 PageMail, Inc.
 +1 613 542 2787 x 102

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

2011-05-16 Thread Mark Davis ☕

In terms of implementation capabilities, there isn't really a significant
practical difference between

   - a UCS-2 implementation, and
   - a UTS-16 implementation that doesn't have supplemental characters in
   its supported repertoire.


Mark

*— Il meglio è l’inimico del bene —*


On Mon, May 16, 2011 at 14:28, Shawn Steele shawn.ste...@microsoft.comwrote:

  I think the problem isn’t so much that the spec used UCS-2, but rather
 that some implementations used UTF-16 instead as that is more convenient in
 many cases.  To the application developer, it’s difficult to tell the
 difference between UCS-2 and UTF-16 if I can use a regular expression to
 find D800, DC00.  Indeed, when the rendering engine of whatever host is
 going to display the glyph for U+1, it’d be hard to notice the subtlety
 of UCS-2 vs UTF-16.



 -Shawn



 *From:* es-discuss-boun...@mozilla.org [mailto:
 es-discuss-boun...@mozilla.org] *On Behalf Of *Jungshik Shin (???, ???)
 *Sent:* Monday, May 16, 2011 2:24 PM
 *To:* Mark Davis ☕
 *Cc:* Markus Scherer; es-discuss@mozilla.org

 *Subject:* Re: Full Unicode strings strawman





 On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ m...@macchiato.com wrote:

 I'm quite sympathetic to the goal, but the proposal does represent a
 significant breaking change. The problem, as Shawn points out, is with
 indexing. Before, the strings were defined as UTF16.



 I agree with Mark wrote except that the previous spec used UCS-2, which
 this proposal (and other proposals on the issue) try to rectify. I think
 that taking Java's approach would work better with DOMString as well.



 See W3C I18N WG's 
 proposalhttp://www.w3.org/International/wiki/JavaScriptInternationalization
 on the issue and Java's 
 approachhttp://java.sun.com/developer/technicalArticles/Intl/Supplementary/linked
  there)



 Jungshik





 Take a sample string \ud800\udc00\u0061 = \u{1}\u{61}. Right now,
 the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a'
 would be at offset 1. This will definitely cause breakage in existing code;
 characters are in different positions than they were, even characters that
 are not supplemental ones. All it takes is one supplemental character before
 the current position and the offsets will be off for the rest of the string.



 Faced with exactly the same problem, Java took a different approach that
 allows for handling of the full range of Unicode characters, but maintains
 backwards compatibility. It may be instructive to look at what they did
 (although there was definitely room for improvement in their approach!). I
 can follow up with that if people are interested. Alternatively, perhaps
 mechanisms can put in place to tell ECMAScript to use new vs old indexing
 (Perl uses PRAGMAs for that kind of thing, for example), although that has
 its own ugliness.



 Mark

 *— Il meglio è l’inimico del bene —*

   On Mon, May 16, 2011 at 13:38, Wes Garland w...@page.ca wrote:

  Allen;

 Thanks for putting this together.  We use Unicode data extensively in both
 our web and server-side applications, and being forced to deal with UTF-16
 surrogate pair directly -- rather than letting the String implementation
 deal with them -- is a constant source of mild pain.  At first blush, this
 proposal looks like it meets all my needs, and my gut tells me the perf
 impacts will probably be neutral or good.

 Two great things about strings composed of Unicode code points:
 1) .length represents the number of code points, rather than the number of
 pairs used in UTF-16, even if the underlying representation isn't UTF-16
 2) S.charCodeAt(S.indexOf(X)) always returns the same kind of information
 (a Unicode code point), regardless of whether X is in the BMP or not

 If though this is a breaking change from ES-5, I support it
 whole-heartedly but I expect breakage to be very limited. Provided that
 the implementation does not restrict the storage of reserved code points
 (D800-DF00), it should be possible for users using String as immutable
 C-arrays to keep doing so. Users doing surrogate pair decomposition will
 probably find that their code just works, as those code points will never
 appear in legitimate strings of Unicode code points.  Users creating Strings
 with surrogate pairs will need to re-tool, but this is a small burden and
 these users will be at the upper strata of Unicode-foodom.  I suspect that
 99.99% of users will find that this change will fix bugs in their code when
 dealing with non-BMP characters.

 Mike Samuel, there would never a supplement code unit to match, as the
 return value of [[Get]] would be a code point.

 Shawn Steele, I don't understand this comment:



 Also, the “trick” I think, is encoding to surrogate pairs (illegally, since
 UTF8 doesn’t allow that) vs decoding to UTF16.


 Why do we care about the UTF-16 representation of particular codepoints?
 Why can't the new functions just encode the Unicode string as UTF-8 and URI
 escape

Re: Full Unicode strings strawman

2011-05-16 Thread Mark Davis ☕

A correction.

U+D800 is indeed a code point: http://www.unicode.org/glossary/#Code_Point. It
is defined for usage in Unicode Strings (see
http://www.unicode.org/glossary/#Unicode_String) because often it is useful
for implementations to be able to allow it in processing.

It does, however, have a special status, and is not representable in
well-formed UTF-xx, for general interchange.

A quick note on the intro to the doc, with a bit more history.

 ECMAScript currently only directly supports the 16-bit basic multilingual
plane (BMP) subset of Unicode which is all that existed when ECMAScript was
first designed. Since then Unicode has been extended to require up to
21-bits per code.

   1. Unicode was extended to to up to 10 in version 2.0, in July of
   1996.
   2. ECMAScript, according to Wikipedia, was first issued in 1997. So
   actually for all of ECMAScript's existence, it has been obsolete in its
   usage of Unicode.
  - (It isn't quite as bad as that, since we made provision for
supplementary characters
  early-on, but the first *actual* supplementary characters appeared in
  2003.)
   3. In 2003, Markus Scherer proposed support for Unicode in ECMAScript v4:
  1. https://sites.google.com/site/markusicu/unicode/es/unicode-2003
  2. https://sites.google.com/site/markusicu/unicode/es/i18n-2003


Mark

*— Il meglio è l’inimico del bene —*


On Mon, May 16, 2011 at 14:42, Boris Zbarsky bzbar...@mit.edu wrote:

 On 5/16/11 4:38 PM, Wes Garland wrote:

 Two great things about strings composed of Unicode code points:

 ...

  If though this is a breaking change from ES-5, I support it
 whole-heartedly but I expect breakage to be very limited. Provided
 that the implementation does not restrict the storage of reserved code
 points (D800-DF00)


 Those aren't code points at all.  They're just not Unicode.

 If you allow storage of such, then you're allowing mixing Unicode strings
 and something else (whatever the something else is), with bad most likely
 bad results.

 Most simply, assignign a DOMString containing surrogates to a JS string
 should collapse the surrogate pairs into the corresponding codepoint if JS
 strings really contain codepoints...

 The only way to make this work is if either DOMString is redefined or
 DOMString and full Unicode strings are different kinds of objects.


  Users doing surrogate pair decomposition will probably find that their
 code just works


 How, exactly?


  Users creating Strings with surrogate pairs will need to
 re-tool


 Such users would include the DOM, right?


  but this is a small burden and these users will be at the upper
 strata of Unicode-foodom.


 You're talking every single web developer here.  Or at least every single
 web developer who wants to work with Devanagari text.


  I suspect that 99.99% of users will find that
 this change will fix bugs in their code when dealing with non-BMP
 characters.


 Not unless DOMString is changed or the interaction between the two very
 carefully defined in failure-proof ways.


  Why do we care about the UTF-16 representation of particular
 codepoints?


 Because of DOMString's use of UTF-16, at least (forced on it by the fact
 that that's what ES used to do, but here we are).


  Mike Samuel, can you explain why you are en/decoding UTF-16 when
 round-tripping through the DOM?  Does the DOM specify UTF-16 encoding?


 Yes.


  If it does, that's silly.


 It needed to specify _something_, and UTF-16 was the thing that was
 compatible with how scripts work in ES.  Not to mention the Java legacy if
 the DOM...


  Both ES and DOM should specify Unicode and let the data interchange
 format be an implementation detail.


 That's fine if _both_ are changed.  Changing just one without the other
 would just cause problems.


  It is an unfortunate accident of history that UTF-16 surrogate pairs leak
 their
 abstraction into ES Strings, and I believe it is high time we fixed that.


 If you can do that without breaking web pages, great.  If not, then we need
 to talk.  ;)

 -Boris

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

2011-05-16 Thread Mark Davis ☕

In practice, the supplemental code points don't really cause problems in
Unicode strings. Most implementations just treat them as if they were
unassigned. The only important issue is that *when* they are converted to
UTF-xx for storage or transmission, they need to be handled; typically by
converting to FFFD (never just deleted - a bad idea for security).

Mark

*— Il meglio è l’inimico del bene —*


On Mon, May 16, 2011 at 14:46, Boris Zbarsky bzbar...@mit.edu wrote:

 On 5/16/11 5:16 PM, Mike Samuel wrote:

 The strawman says

 The String type is the set of all finite ordered sequences of zero or
 more 21-bit unsigned integer values (“elements”).


 Yeah, that's not the same thing as an actual Unicode string, and requires
 handling of all sorts of what if someone sticks non-Unicode in there?
 issues...

 Of course people actually do use JS strings as immutable arrays of 16-bit
 unsigned integers right now (not just as byte arrays), so I suspect that we
 can't easily exclude the surrogate ranges from strings without breaking
 existing content...


 -Boris
 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Collation API not complete for search

2011-03-28 Thread Mark Davis ☕

Searching is discussed in UTS#10. It does need to be correlated with user's
expectations for matching, as you observe.

Mark

*— Il meglio è l’inimico del bene —*


On Mon, Mar 28, 2011 at 14:13, Shawn Steele shawn.ste...@microsoft.comwrote:

  Searching gets tricky.  Is the result greedy or not (matches as much as
 possible or as little as possible), etc.  There are lots of variations,
 which is why it was skipped from the initial v0.5.



 Comparison, Search and Casing are all dependent on each other.  If search
 finds a substring, we’d expect comparison to match that substring.
 Similarly, if one is using Turkish I, we expect all of them to do so.



 - Shawn



 *From:* Nebojša Ćirić [mailto:c...@google.com]
 *Sent:* Monday, March 28, 2011 1:36 PM
 *To:* Mark Davis ☕
 *Cc:* es-discuss@mozilla.org; Shawn Steele; Phillips, Addison
 *Subject:* Re: Collation API not complete for search



 Shawn, would you be ok with adding this new API to the list for 0.5 so we
 can support collation search?



 I'll edit the strawman in case nobody objects to this addition.

 25. март 2011. 16.34, Nebojša Ćirić c...@google.com је написао/ла:

 In that case I wouldn't put this new functionality in the Collator object.
 A new StringSearch or StringIterator object would make more sense:



 options = {

   collator[optional - default, collatorType=search],

   source[required],

   pattern[required]

 }

 LocaleInfo.StringIterator = function(options) {}

 LocaleInfo.StringIterator.prototype.first = function() { find
 first occurrence}

 LocaleInfo.StringIterator.prototype.next = function() { get me
 next occurrence of pattern in source}

 LocaleInfo.StringIterator.prototype.matchLength = function() { length of
 the match }

 ... (reset, setPosition...)

 25. март 2011. 15.14, Mark Davis ☕ m...@macchiato.com је написао/ла:



 I think an iterator is a cleaner interface; we were just trying to minimize
 new API.



 In general, collation is context sensitive, so searching on substrings
 isn't a good idea. You want to search from a location, but have the rest of
 the text available to you.



 For the iterator, you would need to be able to reset to a location, but the
 context beforehand could affect what happens.


 Mark

 *— Il meglio è l’inimico del bene —*



  On Fri, Mar 25, 2011 at 14:22, Mike Samuel mikesam...@gmail.com wrote:

 2011/3/25 Mike Samuel mikesam...@gmail.com:

  2011/3/25 Nebojša Ćirić c...@google.com:
  find method wouldn't return boolean but an array of two values:
 
  Sorry if I wasn't clear.  The !! at the beginning of the call to find
  is important.
  The undefined value you mentioned below as possible no match result is
  falsey because !!undefined === false.
 
  myCollator.find('gaard', 'ard', 2) - [2, 5]  // 4 or 5 as a bound
  myCollator.find('ard', 'ard', 0) - [0, 3]  // 2 or 3 as a bound
  I guess [2, 5] !== [0, 3]
 
  True, but also [2, 5] !== [2, 5].
 
  We could return [-1, undefined] for not found state, or just undefined.
 
  I agree that returning a boolean makes for easier tests in loops.
 
 
  25. март 2011. 14.00, Mike Samuel mikesam...@gmail.com је написао/ла:
 
  2011/3/25 Nebojša Ćirić c...@google.com:
   Looking through the notes from the meeting I also found some problems
   with
   the collator. We did specify the collatorType: search, but we didn't
   offer a
   function that would make use of it. Mark and I are thinking about:
   /**
* string - string to search over.
* substring - string to look for in string
* index - start search from index
* @return {Array} [first, last] - first is index of the match or -1,
   last
   is end of the match or undefined.
*/
   LocaleInfo.Collator.prototype.find(string, substring, index)
   We could also opt for iterator solution where we keep the state.
 
  Assuming find returns a falsey value when nothing is found, is it the
  case that for all (string, index) pairs,
 
  !!myCollator.find(string, substring, index) ===
  !!myCollator.find(string.substring(index), substring, 0)

 Maybe a better way to phrase this relation is

 will any collator ever look at a code-unit to the left of index when
 trying to determine whether there is a match at or after index?

 E.g. if the code-unit at index might be a strict suffix of a substring
 that could be represented as a one codepoint ligature.



  This would be false if the substring 'ard' should be found in 'gard',
  but not 'gaard' because then
 
  !!myCollator.find('gaard', 'ard', 2) !== !!myCollator.find('ard',
  'ard', 0)
 
 
  If that relation does not hold, then exposing find as an iterator
  might help prevent a profusion of subtly wrong loops.
 
 
   The reason we need to return both begin and end part of the found
 string
   is:
   Look for gaard and we find gård - which may be equivalent in Danish,
 but
   substring lengths don't match (5 vs. 4) so we need to tell user the
 next
   index position.
   The other problem Jungshik found is that there is a combinatorial

Re: Collation API not complete for search

2011-03-25 Thread Mark Davis ☕

I think an iterator is a cleaner interface; we were just trying to minimize
new API.

In general, collation is context sensitive, so searching on substrings isn't
a good idea. You want to search from a location, but have the rest of the
text available to you.

For the iterator, you would need to be able to reset to a location, but the
context beforehand could affect what happens.

Mark

*— Il meglio è l’inimico del bene —*


On Fri, Mar 25, 2011 at 14:22, Mike Samuel mikesam...@gmail.com wrote:

 2011/3/25 Mike Samuel mikesam...@gmail.com:
  2011/3/25 Nebojša Ćirić c...@google.com:
  find method wouldn't return boolean but an array of two values:
 
  Sorry if I wasn't clear.  The !! at the beginning of the call to find
  is important.
  The undefined value you mentioned below as possible no match result is
  falsey because !!undefined === false.
 
  myCollator.find('gaard', 'ard', 2) - [2, 5]  // 4 or 5 as a bound
  myCollator.find('ard', 'ard', 0) - [0, 3]  // 2 or 3 as a bound
  I guess [2, 5] !== [0, 3]
 
  True, but also [2, 5] !== [2, 5].
 
  We could return [-1, undefined] for not found state, or just undefined.
 
  I agree that returning a boolean makes for easier tests in loops.
 
 
  25. март 2011. 14.00, Mike Samuel mikesam...@gmail.com је написао/ла:
 
  2011/3/25 Nebojša Ćirić c...@google.com:
   Looking through the notes from the meeting I also found some problems
   with
   the collator. We did specify the collatorType: search, but we didn't
   offer a
   function that would make use of it. Mark and I are thinking about:
   /**
* string - string to search over.
* substring - string to look for in string
* index - start search from index
* @return {Array} [first, last] - first is index of the match or -1,
   last
   is end of the match or undefined.
*/
   LocaleInfo.Collator.prototype.find(string, substring, index)
   We could also opt for iterator solution where we keep the state.
 
  Assuming find returns a falsey value when nothing is found, is it the
  case that for all (string, index) pairs,
 
  !!myCollator.find(string, substring, index) ===
  !!myCollator.find(string.substring(index), substring, 0)

 Maybe a better way to phrase this relation is

 will any collator ever look at a code-unit to the left of index when
 trying to determine whether there is a match at or after index?

 E.g. if the code-unit at index might be a strict suffix of a substring
 that could be represented as a one codepoint ligature.


  This would be false if the substring 'ard' should be found in 'gard',
  but not 'gaard' because then
 
  !!myCollator.find('gaard', 'ard', 2) !== !!myCollator.find('ard',
  'ard', 0)
 
 
  If that relation does not hold, then exposing find as an iterator
  might help prevent a profusion of subtly wrong loops.
 
 
   The reason we need to return both begin and end part of the found
 string
   is:
   Look for gaard and we find gård - which may be equivalent in Danish,
 but
   substring lengths don't match (5 vs. 4) so we need to tell user the
 next
   index position.
   The other problem Jungshik found is that there is a combinatorial
   explosion
   with all ignoreXXX options we defined. My proposal is to define only
 N
   that
   make sense (and can be supported by all implementors) and fall back
 the
   rest
   to some predefined default.
 
 
 
   --
   Nebojša Ćirić
  
   ___
   es-discuss mailing list
   es-discuss@mozilla.org
   https://mail.mozilla.org/listinfo/es-discuss
  
  
 
 
 
  --
  Nebojša Ćirić
 
 
 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Stupid i18n use cases question

2011-01-29 Thread Mark Davis ☕

There are really 5 cases at issue:

   1. Code point breaks
   2. Grapheme-Cluster breaks (with three possible variants: 'legacy',
   extended, and aksha http://www.unicode.org/glossary/#aksara)
   3. Word breaks
   4. Line breaks
   5. Sentence breaks

Notes:

   - #1 is pretty trivial to do right in ES.
   - The others can be done in ES, but the code is more complicated -- the
   biggest issue is that they require a download of a possibly substantial
   amount of data. For certain languages, #3 requires considerable code and
   data.
   - Word-breaks are different than linebreaks; the latter are the points
   where you can wrap a line, which may include more than a word or come in the
   middle of a word.
   - For examples, see http://unicode.org/cldr/utility/breaks.jsp.


I don't know about the specific use cases that Jungshik had in mind, but if
you are doing client-side word-processing in ES (which various software
does, including ours), then you want all of these, except perhaps #5. For
example, a double-click uses #3.

There are other use cases for #4 besides word processing; for example, break
up long SMS's, we break at line-boundaries. I'm not saying that someone has
to do this in ES; just giving an example outside of the word-processing
domain.

Mark

*— Il meglio è l’inimico del bene —*


On Sat, Jan 29, 2011 at 10:25, Shawn Steele shawn.ste...@microsoft.comwrote:

   On the phone yesterday we mentioned word/line breaking and grapheme
 clusters.   It didn't occur to me to ask about the use cases.



 Why does someone need word/line breaking in js?  It seems like that would
 better be done by my rendering engine, like the HTML layout engine or my
 edit control or something?



 -Shawn



  

 http://blogs.msdn.com/shawnste



 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: i18n objects

2011-01-24 Thread Mark Davis ☕

As stated before, I think that this approach is more error prone; that it
would be better to explicitly call the other function. Here would be the
difference between the two alternatives for the API: A and B, under the two
common scenarios:

*Scenario 1 I don't care*

A.
x = myLocaleInfo.region;

B.
x = myLocaleInfo.inferRegion();

*Scenario 2. I only want explicit region*

A.
x = myLocaleInfo.hasInferredRegion ? undefined : myLocaleInfo.region;

B.
x = myLocaleInfo.region();

I find the B approach simpler and clearer, and we don't have to have an
extra input parameter.


Mark

*— Il meglio è l’inimico del bene —*


On Mon, Jan 24, 2011 at 10:25, Shawn Steele shawn.ste...@microsoft.comwrote:

  Considering last week’s discussion on the i18n objects, I think I’ll
 follow this pattern:



 · Constructor takes options, as specified

 · LocaleInfo takes an option to enable inferring.

 o   Default to infer or not is an open question.

 · Have an isInferred() function to test if a property was
 inferred.

 · NO options property

 · Instead individual properties for each value.

 · Using the .derive method to derive a similar object.



 Discussion of each of these should probably have individual threads unless
 they directly impact each other; last week’s thread wandered between topics
 without really resolving them.



 My reasoning:

 · I didn’t use the options property because an options property is
 controversial, and leads to other “hard” questions, like:

 o   Would options represent only the state when constructed?  Or the
 current state?  (Can they differ?)

 o   Would options be read-only?  (And then how would you use it).

 o   Would options be a writable copy (which sounds expensive to me)?

 o   Would options be mutable?

 · It’s clear that we want to be able to infer or not.  If find the
 ability to set it in the constructor much simple.  A disadvantage is that a
 library would have to figure out if inputs were inferred by using
 isInferred().  An advantage is that when a worker doesn’t really care if
 data is inferred or not, then the caller can pass a correctly inferred (or
 not) object to the worker.

 · If there isn’t an options property, then there are fewer
 mechanisms to create a similar derived object.  The suggested .derive()
 function seemed simplest.



 -Shawn





 - Shawn



  

 http://blogs.msdn.com/shawnste

 (Selfhost 7908)



 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: 2nd day meeting comments on the latest i18n API proposal

2011-01-21 Thread Mark Davis ☕

I would actually rather not have it be a construction argument, because it
is easier for people to make mistakes that way.

When I look this over, there are relatively few fields that need this. So
what about having API like:

// get an explicitly-set region, or null if there was no region parameter in
the constructor.

region1 = myLocaleInfo.region;

// gets myLocaleInfo.region if not null
// otherwise infers the region from other information in the LocaleInfo.

aRegion = myLocaleInfo.inferRegion();




The ones where this would be done would be:

inferBaseLanguage()
inferScript()
inferRegion()
inferCurrency()
// later: inferTimeZone()


Mark

*— Il meglio è l’inimico del bene —*


On Fri, Jan 21, 2011 at 11:19, Axel Hecht a...@mozilla.com wrote:

 I for one am very much against setting inferValues to true by default.
 That's just obfuscating i18n bugs, and I think we should design our APIs so
 that it's easy to get it right. Shooting yourself in the foot should be
 hard, if at all possible.

 Axel


 On 21.01.11 09:39, Nebojša Ćirić wrote:

 If we were to go with loc.options, I would define it as having only
 items that were explicitly listed when object was constructed. So:

 var loc = new LocaleInfo({'calendar':'hijri', 'lang': 'en-US',
 'inferValues': true});

 loc.options would contain:
 loc.options.calendar - hijri
 loc.options.lang - en-US
 loc.options.inferValues - true

 loc object would have:
 loc.currency
 loc.lang
 loc.inferValues
 loc.calendar (maybe)
 loc.region
 ...

 One could implement isInferred(x) method like this:
 function isInferred(x) {
   if (x in loc.options  x in loc  loc.options.x === x)
  return false;
   else
  return true;

 There would be some duplication of data with this approach, but it would
 give us easy way to check what were the options we passed in when
 constructed the object, and to detect if the value was inferred or not.

 Btw. I think we agreed to have inferreValues set to true by default.

 20. јануар 2011. 22.15, Peter Constable peter...@microsoft.com
 mailto:peter...@microsoft.com је написао/ла:


We had talked about having an option on the LocaleInfo constructor
to control whether values not explicitly set could be inferred. E.g.

var loc = new LocaleInfo({lang:”en-US”, inferValues:true});

var curr = loc.currency; // is USD

With the approach of having .derive() methods, then the constructed
clone is exactly the same. So, for instance,

var loc2 = loc.derive({calendar:hijri});

if (loc2.isInferred(currency)) // is true

Considering this other approach, when I get opt2…

var opt2 = loc.options;

what all does opt2 include? Does it include only options that were
explicitly passed in when loc was constructed, or does it include
specific values for all LocaleInfo properties that could be set as
options and could be inferred? If the latter, does it reflect that
most of them were inferred?

Peter

*From:*Nebojša Ćirić [mailto:c...@google.com mailto:c...@google.com]

*Sent:* Wednesday, January 19, 2011 5:02 PM
*To:* es-discuss@mozilla.org mailto:es-discuss@mozilla.org

*Subject:* 2nd day meeting comments on the latest i18n API proposal

Eric proposed to remove the derive method from all API objects and
do something like this:

var loc = new LocaleInfo({});  // {...} are the options we
construct LocaleInfo object with.

var opt2 = loc.options;  // This returns a copy of options from loc
object.

opt2.currency = USD;

var loc2 = new LocaleInfo(opt2);

This approach yields the same result with more code, but it's a more
in sync with how people expect JavaScript API to work.

In this case LocaleInfo needs to have options property that holds
original inputs. This approach could also help with inferred values
- one can compare options with actual property and see if they are
the same.

var loc = LocaleInfo({currency: 'USD'});

if (loc.options.currency == loc.currency) ...

--
Nebojša Ćirić




 --
 Nebojša Ćirić



 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: 2nd day meeting comments on the latest i18n API proposal

2011-01-21 Thread Mark Davis ☕

The problem I see is that if I hand you a LocaleInfo, and there is only one
API to get the region, then it (in your words) **is easy** to accidentally
make the wrong choice, or not realize they need to make a choice.

   - x.region may be an explicit value or may be computed: I have to call
   some other API to find out.


If we have a separate API, then I, the developer, have it clearly in front
of me what choice I am to to make, and it **is harder** to accidentally make
the wrong choice, c.

   - x.region is always an explicit value
   - x.inferRegion() is computed if there is no explicit value.


Mark

*— Il meglio è l’inimico del bene —*


On Fri, Jan 21, 2011 at 14:04, Shawn Steele shawn.ste...@microsoft.comwrote:

  IMO that’s going overboard in the other direction J  It’d be nice to find
 some middle ground.



 Sometimes inferring can be very bad.  Sometimes it can be very good.  The
 problem isn’t that one is “right” or “wrong” for all apps, but rather that
 it might be simple for developers to accidentally make the wrong choice, or
 not realize they need to make a choice.  If I need to infer, it should be
 easy to get a fully-inferred LocaleInfo.  If I don’t need to infer, then it
 should be easy to avoid inferring for LocaleInfo.



 -Shawn



 *From:* es-discuss-boun...@mozilla.org [mailto:
 es-discuss-boun...@mozilla.org] *On Behalf Of *Mark Davis ?
 *Sent:* Friday, January 21, 2011 12:44 PM
 *To:* Axel Hecht
 *Cc:* Derek Murman; es-discuss@mozilla.org
 *Subject:* Re: 2nd day meeting comments on the latest i18n API proposal



 I would actually rather not have it be a construction argument, because it
 is easier for people to make mistakes that way.



 When I look this over, there are relatively few fields that need this. So
 what about having API like:



  // get an explicitly-set region, or null if there was no region parameter
 in the constructor.



 region1 = myLocaleInfo.region;



 // gets myLocaleInfo.region if not null

 // otherwise infers the region from other information in the LocaleInfo.



 aRegion = myLocaleInfo.inferRegion();







 The ones where this would be done would be:



 inferBaseLanguage()

 inferScript()

 inferRegion()

 inferCurrency()

 // later: inferTimeZone()





 Mark

 *— Il meglio è l’inimico del bene —*

  On Fri, Jan 21, 2011 at 11:19, Axel Hecht a...@mozilla.com wrote:

 I for one am very much against setting inferValues to true by default.
 That's just obfuscating i18n bugs, and I think we should design our APIs so
 that it's easy to get it right. Shooting yourself in the foot should be
 hard, if at all possible.

 Axel



 On 21.01.11 09:39, Nebojša Ćirić wrote:

  If we were to go with loc.options, I would define it as having only
 items that were explicitly listed when object was constructed. So:

 var loc = new LocaleInfo({'calendar':'hijri', 'lang': 'en-US',
 'inferValues': true});

 loc.options would contain:
 loc.options.calendar - hijri
 loc.options.lang - en-US
 loc.options.inferValues - true

 loc object would have:
 loc.currency
 loc.lang
 loc.inferValues
 loc.calendar (maybe)
 loc.region
 ...

 One could implement isInferred(x) method like this:
 function isInferred(x) {
   if (x in loc.options  x in loc  loc.options.x === x)
  return false;
   else
  return true;

 There would be some duplication of data with this approach, but it would
 give us easy way to check what were the options we passed in when
 constructed the object, and to detect if the value was inferred or not.

 Btw. I think we agreed to have inferreValues set to true by default.

 20. јануар 2011. 22.15, Peter Constable peter...@microsoft.com

 mailto:peter...@microsoft.com је написао/ла:



We had talked about having an option on the LocaleInfo constructor
to control whether values not explicitly set could be inferred. E.g.

var loc = new LocaleInfo({lang:”en-US”, inferValues:true});

var curr = loc.currency; // is USD

With the approach of having .derive() methods, then the constructed
clone is exactly the same. So, for instance,

var loc2 = loc.derive({calendar:hijri});

if (loc2.isInferred(currency)) // is true

Considering this other approach, when I get opt2…

var opt2 = loc.options;

what all does opt2 include? Does it include only options that were
explicitly passed in when loc was constructed, or does it include
specific values for all LocaleInfo properties that could be set as
options and could be inferred? If the latter, does it reflect that
most of them were inferred?

Peter

*From:*Nebojša Ćirić [mailto:c...@google.com mailto:c...@google.com]


*Sent:* Wednesday, January 19, 2011 5:02 PM

*To:* es-discuss@mozilla.org mailto:es-discuss@mozilla.org


*Subject:* 2nd day meeting comments on the latest i18n API proposal

Eric proposed to remove the derive method from all API objects and
do something like this:

var loc = new LocaleInfo

Re: i18n collator options

2011-01-20 Thread Mark Davis ☕

We could do either.

Mark

*— Il meglio è l’inimico del bene —*


On Thu, Jan 20, 2011 at 16:14, Shawn Steele shawn.ste...@microsoft.comwrote:

  For UTF-16 order do you use like the Turkish casing if it was a turkish
 locale?



 -Shawn



 *From:* mark.edward.da...@gmail.com [mailto:mark.edward.da...@gmail.com] *On
 Behalf Of *Mark Davis ?
 *Sent:* Poʻahā, Ianuali 20, 2011 4:05 hours

 *To:* Shawn Steele
 *Cc:* es-discuss@mozilla.org; Peter Constable; Derek Murman
 *Subject:* Re: i18n collator options



 (BTW I haven't gotten added to es-discuss yet, so one of you might forward
 these 3 messages to there.)




 Mark

 *— Il meglio è l’inimico del bene —*

  On Thu, Jan 20, 2011 at 14:48, Shawn Steele shawn.ste...@microsoft.com
 wrote:

 For case-insensitive UTF-16 order, how do you get the casing mappings?



 We use the Unicode mappings.



  Normalization should maybe be deferred to after v0.5.  It’s not a direct
 option for me (I could normalize first though), so it’d require thinking.



 This is not a major case, so I agree on deferring.





 I’d prefer describing options that described the behavior you’ll get as
 opposed to the strength, which kind of bundles stuff together.  I guess that
 runs into the IgnoreDiacritics/IgnoreWidth issue though.



 Yes, we can't support all of the options completely orthogonally. In
 practice, we've never seen a need to distinguish the widths. So I'd suggest
 the intersection of the two:



 · Ordinal – (code point based non-linguistic
 comparisonmutually exclusive with any other option.

 · IgnoreCase – Ignore case - on/off/default

 · IgnoreDiacritics – Ignore diacritics/nonspacing characters -
 on/off/default

 · SortDigitsAsNumbers – Eg: 12 comes before 101 - on/off/default



 The following I don't think is a high priority; that is, the default for
 the language should be fine.

 · IgnoreKanaType – Treat Hiragana and Katakana the same -
 on/off/default











 -Shawn





 *From:* mark.edward.da...@gmail.com [mailto:mark.edward.da...@gmail.com] *On
 Behalf Of *Mark Davis ?
 *Sent:* Poʻahā, Ianuali 20, 2011 12:42 hours
 *To:* Shawn Steele
 *Cc:* es-discuss@mozilla.org; Peter Constable; Derek Murman
 *Subject:* Re: i18n collator options



 On #1 (delaying on others):



 In ICU, the following are very easy:

1. *code point order and/or UTF-16 order. *Options:


 1. case-sensitivity: off, on
   2. normalization: none, nfc, nfc, nfkc, nfkd


1. *language-sensitive. *Options:


 1. Strength: default, primary (ignore accents, case, compat variants),
   secondary (ignore case, variants), tertiary (ignore minor variants),
   identical
   2. Numeric: default, off, on (eg, xyz12  xyz2)
   3. Case: default, force-upper-first, force-lower-first
   4. Punctuation: default, ignore, don't ignore
   5. Case level: default, on, off (to get ignore accents but not
   case, use strength:primary + case-level:on)
   6. Hiragana level: default, on, off (with off, hiragana not
   distinguished from katakana)
   7. (there are other options, but they are less important)

  Under language-sensitive, the *default* for each option may vary
 according to the language.



 You can try out the functionality at http://goo.gl/GQuI



 Compared to Windows (if I read your options correctly), the only real issue
 is that some of the options are not orthogonal: in particular, you can't have
 the equivalent of IgnoreDiacritics=true and IgnoreWidth=false. So if
 someone were to ask for that combination, the best we could supply would be 
 IgnoreDiacritics=true
 and IgnoreWidth=true.





 Mark

 *— Il meglio è l’inimico del bene —*

 On Thu, Jan 20, 2011 at 10:56, Shawn Steele shawn.ste...@microsoft.com
 wrote:

 The i18n group said we’d figure out collator options by email.  This is an
 email J



 The strawman used the “strength” term for collation options, however there
 seemed to be a general feeling that descriptive flags would be more useful.
 So here’s an attempt at some flags.  These “I” can do fairly easily on
 Windows, but I don’t know how they’d work in ICU.



 · Ordinal – (code point based non-linguistic comparison.  This
 sort of defeats the purpose of passing in a locale, however it is a very
 common scenario for some people.  Eg: I don’t want to compare passwords in a
 linguistic fashion.)  This should basically be mutually exclusive with any
 other option.

 · IgnoreCase – Ignore case for case sensitive scripts.  Hopefully
 don’t ignore anything else, but some frameworks may have trouble with that

 · IgnoreDiacritics – Ignore diacritics/nonspacing characters.

 · IgnoreKanaType – Treat Hiragana and Katakana the same.

 · IgnoreWidth – Treat CJK full and halfwidth characters the same.

 · SortDigitsAsNumbers – Eg: 12 comes before 101.



 Are these all “doable” for other frameworks? (ICU?)



 Assuming that we go

Re: EcmaScript i18n API proposal

2010-06-10 Thread Mark Davis ☕

*Re the following message:*
*
*
It is clearly expected that the number of locales available on any
particular device may be limited; a smartphone, for example, might have very
few installed, or have limited services for those it does have installed. With
the locale model, implementations are expected to use the best match. That
is, for a given service (like collation) if there is no support for German
phonebook, then it would fallback to German; if there is no support for
German then it will fall back to some default locale, such as English. The
features of a locale are best thought of as requests, to be used wherever
possible.

That being said, you'd be surprised at how many implementations could easily
support the example German phonebook through underlying OS or library
services, since it has been a standard option on Windows for quite a
while (locale
0x10407 = German with phonebook sort), and in ICU -- thus in the Mac OS, on
Google servers, etc.

Mark
*
*
*=*
*
*
*
*
*Erik Corry* erik.corry at gmail.com
es-discuss%40mozilla.org?Subject=Re:%20Re%3A%20EcmaScript%20i18n%20API%20proposalIn-Reply-To=%3CAANLkTinLzkKUOqz4RPFOUbJydqzz6P9E-8pSDbxFWrJs%40mail.gmail.com%3E
*Wed Jun 9 11:53:14 PDT 2010*


   - Previous message: EcmaScript i18n API proposal 011388.html
   - *Messages sorted by:* [ date ] date.html#11384 [ thread
]thread.html#11384
[ subject ] subject.html#11384 [ author ] author.html#11384

--

On the face of it this proposal introduces a huge new area of
incompatibilities between engines in terms of both which locales are
supported and the details of the locales.  The example (German
phonebook locale) is suitably obscure as to illustrate the
hopelessness of expecting JS engines to contain all thinkable locales.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

58 matches

Mail list logo