RE: [IndexedDB] Spec changes for international language support
From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On Behalf Of Keean Schupke Sent: Friday, March 18, 2011 8:17 PM On 18 March 2011 19:29, Pablo Castro pablo.cas...@microsoft.com wrote: From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On Behalf Of Keean Schupke Sent: Friday, March 18, 2011 1:53 AM See my proposal in another thread. The basic idea is to copy BDB. Have a primary index that is based on an integer, something primitive and fast. Allow secondary indexes which use a callback to generate a binary index key. IDB shifts the complexity out into a library. Common use cases can be provided (a hash of all fields in the object, internationalised bidirectional lexicographic etc...), but the user is free to write their own for less usual cases (for example indexing by the last word in a name string to order by surname). I agree with Jeremy's comments on the other thread for this. Having the callback mechanism definitely sounds interesting but there are a ton of common cases that we can solve by just taking a language identifier, I'm not sure we want to make people work hard to get something that's already supported in most systems. The idea of having a callback to compute the index value feels incremental to this, so we could take on it later on without disrupting the explicit international collation stuff. The idea would be to provide pre-defined implementations of the callback for common use cases, then it is just as simple to register a callback as set any other option. All this means to the API is you pass a function instead of a string. It also is better for modularity as all the code relating to the sort order is kept in the callback functions. The difference comes down to something like: index.set_order_lexicographic('us'); vs index.set_order_method(order_lexicographic('us')); So more than just setting a property like the first case, where presumably all the ordering code is mixed in with the indexing code, the second case encapsulates all the ordering code in the function returned from the execution of order_lexicographic('us'). This function would represent a mapping from the object being indexed to a binary blob that is the actual stored index data. So doing it this was does not necessarily make things harder, and it improves encapsulation, the type-safety, and the flexibility of the API. Yep, we talked about supporting callbacks already in the other threads and in this one. As I mentioned before, I think this is an incremental to the basic feature of taking a collation name. I do realize you can just pass a pre-implemented function, but that opens the door to a bunch of things we'd need to handle, including storing possibly storing code in the database (such that proper updates don't depend on each page re-registering all the index callbacks), handling scripts with the appropriate context to run during index updates, etc. I would much rather have basic functionality in place and then expand as needed once we have users using the API. Thanks -pablo
Re: [IndexedDB] Spec changes for international language support
On Tue, Mar 22, 2011 at 6:13 PM, Pablo Castro pablo.cas...@microsoft.com wrote: From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On Behalf Of Keean Schupke Sent: Tuesday, March 22, 2011 5:34 PM IMHO not the job of Idb to store the callbacks, so I don't see this complexity as a reason not to implement the API using callbacks. I think having one consistent API is more important. Specifying the collation 'name' has all the same problems as callbacks (needs to be re-done on every page, possibility of using different collations on different pages). Really a 'function' is just a symbol for a collation. A function name, is a better symbol for a collation than a string. Function's have a uniqueness property strings do not. So specifying a function as the collations instead of a string really is the same thing. Consider below: I don't think it's the same. If we don't store the callbacks in the database it means every page has to have full knowledge of the database schema (at least all the indexes) all the time, instead of just pulling that in on demand when needed. It also means we can never allow browser developer tools or generic dev-tool-webpages to modify the database because indexes would become invalid (not sure allowing tools to mess with the database in general is a good idea, but I thought it illustrated the point well). I wonder if the overall issue we're discussing has to do with how embedded the database is. In BDB scenarios where the database is completely invisible outside of an application many of these decisions make more sense. I don't think of web applications that way. I think of them more as a number of building blocks (pages, pieces within pages, tool pages added on the side) that are authored and sometimes even versioned independently, and the interface between those building blocks and the store is public and visible to tools and generic data browsers. All that changes the assumptions in the overall picture. Yup. I Agree with Pablo here. / Jonas
Re: [IndexedDB] Spec changes for international language support
See my proposal in another thread. The basic idea is to copy BDB. Have a primary index that is based on an integer, something primitive and fast. Allow secondary indexes which use a callback to generate a binary index key. IDB shifts the complexity out into a library. Common use cases can be provided (a hash of all fields in the object, internationalised bidirectional lexicographic etc...), but the user is free to write their own for less usual cases (for example indexing by the last word in a name string to order by surname). Cheers, Keean. On 18 March 2011 02:19, Jonas Sicking jo...@sicking.cc wrote: 2011/3/17 Pablo Castro pablo.cas...@microsoft.com: From: Jonas Sicking [mailto:jo...@sicking.cc] Sent: Tuesday, March 08, 2011 1:11 PM All in all, is there anything preventing adding the API Pablo suggests in this thread to the IndexedDB spec drafts? I wanted to propose a couple of specific tweaks to the initial proposal and then unless I hear pushback start editing this into the spec. From reading the details on this thread I'm starting to realize that per-database collations won't do it. What did it for me was the example that has a fuzzier matching mode (case/accent insensitive). This is exactly the kind of index I would want to sort people's names in my address book, but most likely not the index I'll want to use for my primary key. Refactoring the API to accommodate for this would mean to move the setCollation() method and the collation property to the object store and index objects. If we were willing to live without the ability to change them we could take collation as one of the optional parameters to createObjectStore()/createIndex() and reduce a bit of surface area... Unfortunately I think you bring up good use cases for per-objectStore/index collations. It's definitely tempting to just add it as a optional parameter to createObjectStore/createIndex. The downside is obviously pushing more complexity onto web developers. Complexity which will be duplicated across sites. However there is another problem to consider here. Can switching collation on a objectStore or a unique index can affect its validity? I.e. if you switch from a case sensitive to a case insensitive collation, does that mean that if you have two entries with the primary keys Sweden and sweden they collide and thus the change of collation must result in an error (or aborted transaction)? I do seem to recall that there are ways to do at least case sensitivity such that you generally don't take case into account when sorting, unless two entries are exactly the same, in which case you do look at casing to differentiate them. However I don't really know a whole lot about this and so defer to people that know internationalization better. I don't have a strong preference there. In any case both would use BCP47 names as discussed in this thread (as Jonas pointed out, implementations can also do their thing as long as they don't interfere with BCP47). Another piece of feedback I heard consistently as I discussed this with various folks at Microsoft is the need to be able to pick up what the UA would consider the collation that's most appropriate for the user environment (derived from settings, page language or whatever). We could support this by introducing a special value that you can pass to setCollation that indicates pick whatever is the right for the environment's language right now. Given that there is no other way for people to discover the user preference on this, I think this is pretty important. I would be fine with this as long as it's a explicit opt-in. There is definitely a risk that people will do this and then only do testing in one language, but it seems to me like a useful use case to support, and I don't see a way of supporting this while completely avoiding the risk of internationalization bugs. / Jonas
RE: [IndexedDB] Spec changes for international language support
From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On Behalf Of Keean Schupke Sent: Friday, March 18, 2011 1:53 AM See my proposal in another thread. The basic idea is to copy BDB. Have a primary index that is based on an integer, something primitive and fast. Allow secondary indexes which use a callback to generate a binary index key. IDB shifts the complexity out into a library. Common use cases can be provided (a hash of all fields in the object, internationalised bidirectional lexicographic etc...), but the user is free to write their own for less usual cases (for example indexing by the last word in a name string to order by surname). I agree with Jeremy's comments on the other thread for this. Having the callback mechanism definitely sounds interesting but there are a ton of common cases that we can solve by just taking a language identifier, I'm not sure we want to make people work hard to get something that's already supported in most systems. The idea of having a callback to compute the index value feels incremental to this, so we could take on it later on without disrupting the explicit international collation stuff. On 18 March 2011 02:19, Jonas Sicking jo...@sicking.cc wrote: 2011/3/17 Pablo Castro pablo.cas...@microsoft.com: From: Jonas Sicking [mailto:jo...@sicking.cc] Sent: Tuesday, March 08, 2011 1:11 PM All in all, is there anything preventing adding the API Pablo suggests in this thread to the IndexedDB spec drafts? I wanted to propose a couple of specific tweaks to the initial proposal and then unless I hear pushback start editing this into the spec. From reading the details on this thread I'm starting to realize that per-database collations won't do it. What did it for me was the example that has a fuzzier matching mode (case/accent insensitive). This is exactly the kind of index I would want to sort people's names in my address book, but most likely not the index I'll want to use for my primary key. Refactoring the API to accommodate for this would mean to move the setCollation() method and the collation property to the object store and index objects. If we were willing to live without the ability to change them we could take collation as one of the optional parameters to createObjectStore()/createIndex() and reduce a bit of surface area... Unfortunately I think you bring up good use cases for per-objectStore/index collations. It's definitely tempting to just add it as a optional parameter to createObjectStore/createIndex. The downside is obviously pushing more complexity onto web developers. Complexity which will be duplicated across sites. However there is another problem to consider here. Can switching collation on a objectStore or a unique index can affect its validity? I.e. if you switch from a case sensitive to a case insensitive collation, does that mean that if you have two entries with the primary keys Sweden and sweden they collide and thus the change of collation must result in an error (or aborted transaction)? I do seem to recall that there are ways to do at least case sensitivity such that you generally don't take case into account when sorting, unless two entries are exactly the same, in which case you do look at casing to differentiate them. However I don't really know a whole lot about this and so defer to people that know internationalization better. This is a good point. It makes me lean toward not allowing changing the collation of an index or store. That means we could just have an optional parameter (in the generic parameter object thingy we have now) on createObjectStore and createIndex that indicates the collation name. It seems minimally disruptive, it doesn't tax people that don't care about it, and since there is no setCollation we don't have the problem of not being able to re-index the data. Another piece of feedback I heard consistently as I discussed this with various folks at Microsoft is the need to be able to pick up what the UA would consider the collation that's most appropriate for the user environment (derived from settings, page language or whatever). We could support this by introducing a special value that you can pass to setCollation that indicates pick whatever is the right for the environment's language right now. Given that there is no other way for people to discover the user preference on this, I think this is pretty important. I would be fine with this as long as it's a explicit opt-in. There is definitely a risk that people will do this and then only do testing in one language, but it seems to me like a useful use case to support, and I don't see a way of supporting this while completely avoiding the risk of internationalization bugs. I agree, it should be opt-in. I still assume we'll default to binary collation (same if you specify the collation value as null). I was
Re: [IndexedDB] Spec changes for international language support
On Fri, Mar 18, 2011 at 12:29 PM, Pablo Castro pablo.cas...@microsoft.com wrote: From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On Behalf Of Keean Schupke Sent: Friday, March 18, 2011 1:53 AM See my proposal in another thread. The basic idea is to copy BDB. Have a primary index that is based on an integer, something primitive and fast. Allow secondary indexes which use a callback to generate a binary index key. IDB shifts the complexity out into a library. Common use cases can be provided (a hash of all fields in the object, internationalised bidirectional lexicographic etc...), but the user is free to write their own for less usual cases (for example indexing by the last word in a name string to order by surname). I agree with Jeremy's comments on the other thread for this. Having the callback mechanism definitely sounds interesting but there are a ton of common cases that we can solve by just taking a language identifier, I'm not sure we want to make people work hard to get something that's already supported in most systems. The idea of having a callback to compute the index value feels incremental to this, so we could take on it later on without disrupting the explicit international collation stuff. On 18 March 2011 02:19, Jonas Sicking jo...@sicking.cc wrote: 2011/3/17 Pablo Castro pablo.cas...@microsoft.com: From: Jonas Sicking [mailto:jo...@sicking.cc] Sent: Tuesday, March 08, 2011 1:11 PM All in all, is there anything preventing adding the API Pablo suggests in this thread to the IndexedDB spec drafts? I wanted to propose a couple of specific tweaks to the initial proposal and then unless I hear pushback start editing this into the spec. From reading the details on this thread I'm starting to realize that per-database collations won't do it. What did it for me was the example that has a fuzzier matching mode (case/accent insensitive). This is exactly the kind of index I would want to sort people's names in my address book, but most likely not the index I'll want to use for my primary key. Refactoring the API to accommodate for this would mean to move the setCollation() method and the collation property to the object store and index objects. If we were willing to live without the ability to change them we could take collation as one of the optional parameters to createObjectStore()/createIndex() and reduce a bit of surface area... Unfortunately I think you bring up good use cases for per-objectStore/index collations. It's definitely tempting to just add it as a optional parameter to createObjectStore/createIndex. The downside is obviously pushing more complexity onto web developers. Complexity which will be duplicated across sites. However there is another problem to consider here. Can switching collation on a objectStore or a unique index can affect its validity? I.e. if you switch from a case sensitive to a case insensitive collation, does that mean that if you have two entries with the primary keys Sweden and sweden they collide and thus the change of collation must result in an error (or aborted transaction)? I do seem to recall that there are ways to do at least case sensitivity such that you generally don't take case into account when sorting, unless two entries are exactly the same, in which case you do look at casing to differentiate them. However I don't really know a whole lot about this and so defer to people that know internationalization better. This is a good point. It makes me lean toward not allowing changing the collation of an index or store. That means we could just have an optional parameter (in the generic parameter object thingy we have now) on createObjectStore and createIndex that indicates the collation name. It seems minimally disruptive, it doesn't tax people that don't care about it, and since there is no setCollation we don't have the problem of not being able to re-index the data. So there is no way to specify things such that the collation doesn't affect unique-ness? If so, I tend to agree. Another piece of feedback I heard consistently as I discussed this with various folks at Microsoft is the need to be able to pick up what the UA would consider the collation that's most appropriate for the user environment (derived from settings, page language or whatever). We could support this by introducing a special value that you can pass to setCollation that indicates pick whatever is the right for the environment's language right now. Given that there is no other way for people to discover the user preference on this, I think this is pretty important. I would be fine with this as long as it's a explicit opt-in. There is definitely a risk that people will do this and then only do testing in one language, but it seems to me like a useful use case to support, and I don't see a way of supporting
RE: [IndexedDB] Spec changes for international language support
From: Jonas Sicking [mailto:jo...@sicking.cc] Sent: Friday, March 18, 2011 1:57 PM However there is another problem to consider here. Can switching collation on a objectStore or a unique index can affect its validity? I.e. if you switch from a case sensitive to a case insensitive collation, does that mean that if you have two entries with the primary keys Sweden and sweden they collide and thus the change of collation must result in an error (or aborted transaction)? I do seem to recall that there are ways to do at least case sensitivity such that you generally don't take case into account when sorting, unless two entries are exactly the same, in which case you do look at casing to differentiate them. However I don't really know a whole lot about this and so defer to people that know internationalization better. This is a good point. It makes me lean toward not allowing changing the collation of an index or store. That means we could just have an optional parameter (in the generic parameter object thingy we have now) on createObjectStore and createIndex that indicates the collation name. It seems minimally disruptive, it doesn't tax people that don't care about it, and since there is no setCollation we don't have the problem of not being able to re-index the data. So there is no way to specify things such that the collation doesn't affect unique-ness? If so, I tend to agree. The problem is that different collations will consider different things unique. This is bound to be variable across languages and such, so I'm not sure we want to be in the business of fine-tuning this. It seems that being a bit more restrictive could result in a more robust result overall. If someone really needs to change the collation they can copy the table manually...not great, but if we think it's a corner case it's probably fine. Another piece of feedback I heard consistently as I discussed this with various folks at Microsoft is the need to be able to pick up what the UA would consider the collation that's most appropriate for the user environment (derived from settings, page language or whatever). We could support this by introducing a special value that you can pass to setCollation that indicates pick whatever is the right for the environment's language right now. Given that there is no other way for people to discover the user preference on this, I think this is pretty important. I would be fine with this as long as it's a explicit opt-in. There is definitely a risk that people will do this and then only do testing in one language, but it seems to me like a useful use case to support, and I don't see a way of supporting this while completely avoiding the risk of internationalization bugs. I agree, it should be opt-in. I still assume we'll default to binary collation (same if you specify the collation value as null). I was reading the BCP 47 [1] and in section 4.1 Choice of Language Tag the item #7 seems to describe what we're looking for. The value i-default seems to match our needs close enough, so callers could use that value. Discoverability is not great, but we avoid having to specify something new, and arguably they'll need to read somewhere that this argument is a BCP47-compatible value, and we could put a comment about i-default right there. Sounds good to me. Though you seem to have forgotten to include the [1] reference. Oops, here it goes: [1] http://tools.ietf.org/html/bcp47
Re: [IndexedDB] Spec changes for international language support
On 18 March 2011 19:29, Pablo Castro pablo.cas...@microsoft.com wrote: From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On Behalf Of Keean Schupke Sent: Friday, March 18, 2011 1:53 AM See my proposal in another thread. The basic idea is to copy BDB. Have a primary index that is based on an integer, something primitive and fast. Allow secondary indexes which use a callback to generate a binary index key. IDB shifts the complexity out into a library. Common use cases can be provided (a hash of all fields in the object, internationalised bidirectional lexicographic etc...), but the user is free to write their own for less usual cases (for example indexing by the last word in a name string to order by surname). I agree with Jeremy's comments on the other thread for this. Having the callback mechanism definitely sounds interesting but there are a ton of common cases that we can solve by just taking a language identifier, I'm not sure we want to make people work hard to get something that's already supported in most systems. The idea of having a callback to compute the index value feels incremental to this, so we could take on it later on without disrupting the explicit international collation stuff. The idea would be to provide pre-defined implementations of the callback for common use cases, then it is just as simple to register a callback as set any other option. All this means to the API is you pass a function instead of a string. It also is better for modularity as all the code relating to the sort order is kept in the callback functions. The difference comes down to something like: index.set_order_lexicographic('us'); vs index.set_order_method(order_lexicographic('us')); So more than just setting a property like the first case, where presumably all the ordering code is mixed in with the indexing code, the second case encapsulates all the ordering code in the function returned from the execution of order_lexicographic('us'). This function would represent a mapping from the object being indexed to a binary blob that is the actual stored index data. So doing it this was does not necessarily make things harder, and it improves encapsulation, the type-safety, and the flexibility of the API. Cheers, Keean.
Re: [IndexedDB] Spec changes for international language support
FWIW, this maybe would have been better off as its own thread. :-) On Thu, Mar 17, 2011 at 3:37 PM, Pablo Castro pablo.cas...@microsoft.comwrote: From: Jonas Sicking [mailto:jo...@sicking.cc] Sent: Tuesday, March 08, 2011 1:11 PM All in all, is there anything preventing adding the API Pablo suggests in this thread to the IndexedDB spec drafts? I wanted to propose a couple of specific tweaks to the initial proposal and then unless I hear pushback start editing this into the spec. From reading the details on this thread I'm starting to realize that per-database collations won't do it. What did it for me was the example that has a fuzzier matching mode (case/accent insensitive). This is exactly the kind of index I would want to sort people's names in my address book, but most likely not the index I'll want to use for my primary key. Refactoring the API to accommodate for this would mean to move the setCollation() method and the collation property to the object store and index objects. If we were willing to live without the ability to change them we could take collation as one of the optional parameters to createObjectStore()/createIndex() and reduce a bit of surface area...I don't have a strong preference there. In any case both would use BCP47 names as discussed in this thread (as Jonas pointed out, implementations can also do their thing as long as they don't interfere with BCP47). I'm fine with this. Another (I believe) related use case I ran into today is wanting collation to be case insensitive. Another piece of feedback I heard consistently as I discussed this with various folks at Microsoft is the need to be able to pick up what the UA would consider the collation that's most appropriate for the user environment (derived from settings, page language or whatever). We could support this by introducing a special value that you can pass to setCollation that indicates pick whatever is the right for the environment's language right now. Given that there is no other way for people to discover the user preference on this, I think this is pretty important. This seems useful even outside of the context of IndexedDB. It should probably be added to some other spec. I'm fine adding it to ours for now and adding an issue along with it. But if so, please do shop it around. J
Re: [IndexedDB] Spec changes for international language support
2011/3/17 Pablo Castro pablo.cas...@microsoft.com: From: Jonas Sicking [mailto:jo...@sicking.cc] Sent: Tuesday, March 08, 2011 1:11 PM All in all, is there anything preventing adding the API Pablo suggests in this thread to the IndexedDB spec drafts? I wanted to propose a couple of specific tweaks to the initial proposal and then unless I hear pushback start editing this into the spec. From reading the details on this thread I'm starting to realize that per-database collations won't do it. What did it for me was the example that has a fuzzier matching mode (case/accent insensitive). This is exactly the kind of index I would want to sort people's names in my address book, but most likely not the index I'll want to use for my primary key. Refactoring the API to accommodate for this would mean to move the setCollation() method and the collation property to the object store and index objects. If we were willing to live without the ability to change them we could take collation as one of the optional parameters to createObjectStore()/createIndex() and reduce a bit of surface area... Unfortunately I think you bring up good use cases for per-objectStore/index collations. It's definitely tempting to just add it as a optional parameter to createObjectStore/createIndex. The downside is obviously pushing more complexity onto web developers. Complexity which will be duplicated across sites. However there is another problem to consider here. Can switching collation on a objectStore or a unique index can affect its validity? I.e. if you switch from a case sensitive to a case insensitive collation, does that mean that if you have two entries with the primary keys Sweden and sweden they collide and thus the change of collation must result in an error (or aborted transaction)? I do seem to recall that there are ways to do at least case sensitivity such that you generally don't take case into account when sorting, unless two entries are exactly the same, in which case you do look at casing to differentiate them. However I don't really know a whole lot about this and so defer to people that know internationalization better. I don't have a strong preference there. In any case both would use BCP47 names as discussed in this thread (as Jonas pointed out, implementations can also do their thing as long as they don't interfere with BCP47). Another piece of feedback I heard consistently as I discussed this with various folks at Microsoft is the need to be able to pick up what the UA would consider the collation that's most appropriate for the user environment (derived from settings, page language or whatever). We could support this by introducing a special value that you can pass to setCollation that indicates pick whatever is the right for the environment's language right now. Given that there is no other way for people to discover the user preference on this, I think this is pretty important. I would be fine with this as long as it's a explicit opt-in. There is definitely a risk that people will do this and then only do testing in one language, but it seems to me like a useful use case to support, and I don't see a way of supporting this while completely avoiding the risk of internationalization bugs. / Jonas
Re: [IndexedDB] Spec changes for international language support
2011/2/23 Pablo Castro pablo.cas...@microsoft.com: From: jungs...@google.com [mailto:jungs...@google.com] On Behalf Of Jungshik Shin (???, ???) Sent: Tuesday, February 22, 2011 2:08 PM On Fri, Feb 18, 2011 at 2:34 AM, Bjoern Hoehrmann derhoe...@gmx.net wrote: * Pablo Castro wrote: We discussed international language support last time at the TPAC and I said I'd propose spec text for it. Please find the patch below, the changes mirror exactly the proposal described in the bug we have for tracking this: http://www.w3.org/Bugs/Public/show_bug.cgi?id=9903 You should anticipate objections to that; collation is not a property of language, for instance, for de-de you typically have dictionary sorting and phone book sorting (and of course you have de-de, de-ch, and so on, so de alone would be rather meaningless). So far the W3C and the IETF have used resource identifiers to specify collations (see XPath 2.0 and RFC 4790) where the IETF allows shorthands like i;ascii-casemap. I agree that simply specifying that 'language' be used without saying what it means is not sufficient. However, your examples (German phonebook vs dictionary) can be covered with language identifier framework laid out in BCP47 (with 'u' extension). Fair enough. I'll adjust this part of the write up to discuss this in terms of collation identifier or language identifier. I do understand that Microsoft uses an extension of language tags for the `CultureInfo` in the .NET Framework, where, say, `de-DE_phoneb` is used to refer to german phone book sorting, but BCP 47 does not allow for that, There's a way to specify alternate sorting orders (e.g. German phonebook, Chinese pinyin, stroke count, radical-stroke count order, etc) under the BCP 47 framework because it has a mechanism for defining an extension and registering it. The Unicode consortium uses that mechanism to define 'u' extension and a set of subtags that can be used with 'u'. For instance, German phonebook sorting can be identified with 'de-DE-u-co-phonebk'. See https://tools.ietf.org/html/bcp47 https://tools.ietf.org/html/rfc6067 http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers Also, see Bug 9903 comment 6 by Mark Davis for more examples. Well, I'm just copying his comment directly here: To add to what Jungshik said, BCP47 defines standard extensions. The extension defined by the Unicode consortium (http://cldr.unicode.org/index/bcp47-extension) provides for fine-grained specifications of collation behavior. Examples for German: de-u-co-phonebk // phonebook order de-u-kn-true // numeric sorting, eg Tom2 comes before Tom12 de-u-ks-level1 // ignore accents, case differences de-u-ks-level2 // ignore case differences de-u-ks-level1-kc-true // ignore accents, but not case These can be combined, such as: de-u-co-phonebk-kn-true-ks-level1-kc-true neither could you devise a language tag to define something like i;ascii-casemap (which simply defines A-Z = a-z). I'm not sure how specific we want to get into this. In particular, would be it better if we specified it all the way (including which extensions UAs need to support) or if we used BCP47 as the starting point and allowed UAs to support additional extensions as needed? I think for now we should allow implementations to support additional collations in additions to whatever set we specify. It seems to me that this is an area that is heavily in flux and I'd hate to paint ourselves into a corner. I would expect that if browsers offer collations, there would be an in- terface for that so you can use them in other places, as such it might be wiser to accept something other than a language identifier string. There's an on-going effort to expose a 'rich' set of I18N API to client-side development using Javascript ( http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api : The API used be much more extensive than now, but has been scaled down significantly to get more browsers on board in its 1st iteration). There we're likely to use BCP 47 with 'u' extension (see above). So, I think it'd be better if IndexedDB matches what ECMAScript plans to do. This is interesting, do you know how far along is this? And does someone have a link to drafts? I suspect we don't want to wait for this work to finish, but we should definitely track it and seek inspiration. And there are probably people there that can review whatever we're doing. I also note that collation often involves equivalence testing, but it is not clear from your proposal whether that is the case here. It might also be a good idea to clearly spell out interoperability expectations; if two implementations support some collation, will they behave the same for any and all inputs as far as collation is concerned, or should one be prepared for slight differences among implementations? I think it's more practical to assume that users should be prepared for
RE: [IndexedDB] Spec changes for international language support
From: jungs...@google.com [mailto:jungs...@google.com] On Behalf Of Jungshik Shin (???, ???) Sent: Tuesday, February 22, 2011 2:08 PM On Fri, Feb 18, 2011 at 2:34 AM, Bjoern Hoehrmann derhoe...@gmx.net wrote: * Pablo Castro wrote: We discussed international language support last time at the TPAC and I said I'd propose spec text for it. Please find the patch below, the changes mirror exactly the proposal described in the bug we have for tracking this: http://www.w3.org/Bugs/Public/show_bug.cgi?id=9903 You should anticipate objections to that; collation is not a property of language, for instance, for de-de you typically have dictionary sorting and phone book sorting (and of course you have de-de, de-ch, and so on, so de alone would be rather meaningless). So far the W3C and the IETF have used resource identifiers to specify collations (see XPath 2.0 and RFC 4790) where the IETF allows shorthands like i;ascii-casemap. I agree that simply specifying that 'language' be used without saying what it means is not sufficient. However, your examples (German phonebook vs dictionary) can be covered with language identifier framework laid out in BCP47 (with 'u' extension). Fair enough. I'll adjust this part of the write up to discuss this in terms of collation identifier or language identifier. I do understand that Microsoft uses an extension of language tags for the `CultureInfo` in the .NET Framework, where, say, `de-DE_phoneb` is used to refer to german phone book sorting, but BCP 47 does not allow for that, There's a way to specify alternate sorting orders (e.g. German phonebook, Chinese pinyin, stroke count, radical-stroke count order, etc) under the BCP 47 framework because it has a mechanism for defining an extension and registering it. The Unicode consortium uses that mechanism to define 'u' extension and a set of subtags that can be used with 'u'. For instance, German phonebook sorting can be identified with 'de-DE-u-co-phonebk'. See https://tools.ietf.org/html/bcp47 https://tools.ietf.org/html/rfc6067 http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers Also, see Bug 9903 comment 6 by Mark Davis for more examples. Well, I'm just copying his comment directly here: To add to what Jungshik said, BCP47 defines standard extensions. The extension defined by the Unicode consortium (http://cldr.unicode.org/index/bcp47-extension) provides for fine-grained specifications of collation behavior. Examples for German: de-u-co-phonebk // phonebook order de-u-kn-true // numeric sorting, eg Tom2 comes before Tom12 de-u-ks-level1 // ignore accents, case differences de-u-ks-level2 // ignore case differences de-u-ks-level1-kc-true // ignore accents, but not case These can be combined, such as: de-u-co-phonebk-kn-true-ks-level1-kc-true neither could you devise a language tag to define something like i;ascii-casemap (which simply defines A-Z = a-z). I'm not sure how specific we want to get into this. In particular, would be it better if we specified it all the way (including which extensions UAs need to support) or if we used BCP47 as the starting point and allowed UAs to support additional extensions as needed? I would expect that if browsers offer collations, there would be an in- terface for that so you can use them in other places, as such it might be wiser to accept something other than a language identifier string. There's an on-going effort to expose a 'rich' set of I18N API to client-side development using Javascript ( http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api : The API used be much more extensive than now, but has been scaled down significantly to get more browsers on board in its 1st iteration). There we're likely to use BCP 47 with 'u' extension (see above). So, I think it'd be better if IndexedDB matches what ECMAScript plans to do. This is interesting, do you know how far along is this? I also note that collation often involves equivalence testing, but it is not clear from your proposal whether that is the case here. It might also be a good idea to clearly spell out interoperability expectations; if two implementations support some collation, will they behave the same for any and all inputs as far as collation is concerned, or should one be prepared for slight differences among implementations? I think it's more practical to assume that users should be prepared for slight differences among implementations. Thanks -pablo
Re: [IndexedDB] Spec changes for international language support
On Fri, Feb 18, 2011 at 2:34 AM, Bjoern Hoehrmann derhoe...@gmx.net wrote: * Pablo Castro wrote: We discussed international language support last time at the TPAC and I said I'd propose spec text for it. Please find the patch below, the changes mirror exactly the proposal described in the bug we have for tracking this: http://www.w3.org/Bugs/Public/show_bug.cgi?id=9903 You should anticipate objections to that; collation is not a property of language, for instance, for de-de you typically have dictionary sorting and phone book sorting (and of course you have de-de, de-ch, and so on, so de alone would be rather meaningless). So far the W3C and the IETF have used resource identifiers to specify collations (see XPath 2.0 and RFC 4790) where the IETF allows shorthands like i;ascii-casemap. I agree that simply specifying that 'language' be used without saying what it means is not sufficient. However, your examples (German phonebook vs dictionary) can be covered with language identifier framework laid out in BCP47 (with 'u' extension). I do understand that Microsoft uses an extension of language tags for the `CultureInfo` in the .NET Framework, where, say, `de-DE_phoneb` is used to refer to german phone book sorting, but BCP 47 does not allow for that, There's a way to specify alternate sorting orders (e.g. German phonebook, Chinese pinyin, stroke count, radical-stroke count order, etc) under the BCP 47 framework because it has a mechanism for defining an extension and registering it. The Unicode consortium uses that mechanism to define 'u' extension and a set of subtags that can be used with 'u'. For instance, German phonebook sorting can be identified with 'de-DE-u-co-phonebk'. See https://tools.ietf.org/html/bcp47 https://tools.ietf.org/html/rfc6067 http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers Also, see Bug 9903 comment 6 by Mark Davishttp://www.w3.org/Bugs/Public/show_bug.cgi?id=9903#c6 for more examples. Well, I'm just copying his comment directly here: To add to what Jungshik said, BCP47 defines standard extensions. The extension defined by the Unicode consortium (http://cldr.unicode.org/index/bcp47-extension) provides for fine-grained specifications of collation behavior. Examples for German: de-u-co-phonebk // phonebook order de-u-kn-true // numeric sorting, eg Tom2 comes before Tom12 de-u-ks-level1 // ignore accents, case differences de-u-ks-level2 // ignore case differences de-u-ks-level1-kc-true // ignore accents, but not case These can be combined, such as: de-u-co-phonebk-kn-true-ks-level1-kc-true neither could you devise a language tag to define something like i;ascii-casemap (which simply defines A-Z = a-z). I would expect that if browsers offer collations, there would be an in- terface for that so you can use them in other places, as such it might be wiser to accept something other than a language identifier string. There's an on-going effort to expose a 'rich' set of I18N API to client-side development using Javascript ( http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api : The API used be much more extensive than now, but has been scaled down significantly to get more browsers on board in its 1st iteration). There we're likely to use BCP 47 with 'u' extension (see above). So, I think it'd be better if IndexedDB matches what ECMAScript plans to do. Jungshik As above, URIs, or RFC 4790 values plus URIs, or, in anticipation of some such interface, some other object, might be a better choice. And the method and attribute should probably not use language in their names. I also note that collation often involves equivalence testing, but it is not clear from your proposal whether that is the case here. It might also be a good idea to clearly spell out interoperability expectations; if two implementations support some collation, will they behave the same for any and all inputs as far as collation is concerned, or should one be prepared for slight differences among implementations? -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: %2B49%280%29160%2F4415681+49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: [IndexedDB] Spec changes for international language support
* Pablo Castro wrote: We discussed international language support last time at the TPAC and I said I'd propose spec text for it. Please find the patch below, the changes mirror exactly the proposal described in the bug we have for tracking this: http://www.w3.org/Bugs/Public/show_bug.cgi?id=9903 You should anticipate objections to that; collation is not a property of language, for instance, for de-de you typically have dictionary sorting and phone book sorting (and of course you have de-de, de-ch, and so on, so de alone would be rather meaningless). So far the W3C and the IETF have used resource identifiers to specify collations (see XPath 2.0 and RFC 4790) where the IETF allows shorthands like i;ascii-casemap. I do understand that Microsoft uses an extension of language tags for the `CultureInfo` in the .NET Framework, where, say, `de-DE_phoneb` is used to refer to german phone book sorting, but BCP 47 does not allow for that, neither could you devise a language tag to define something like i;ascii-casemap (which simply defines A-Z = a-z). I would expect that if browsers offer collations, there would be an in- terface for that so you can use them in other places, as such it might be wiser to accept something other than a language identifier string. As above, URIs, or RFC 4790 values plus URIs, or, in anticipation of some such interface, some other object, might be a better choice. And the method and attribute should probably not use language in their names. I also note that collation often involves equivalence testing, but it is not clear from your proposal whether that is the case here. It might also be a good idea to clearly spell out interoperability expectations; if two implementations support some collation, will they behave the same for any and all inputs as far as collation is concerned, or should one be prepared for slight differences among implementations? -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Re: [IndexedDB] Spec changes for international language support
Hi Pablo, I will reassign this bug to Eliott. Nikunj On Feb 17, 2011, at 6:38 PM, Pablo Castro wrote: btw - the bug is assigned to Nikunj right now but I think that's just because of an editing glitch. Nikunj please let me know if you were working on it, otherwise I'll just submit the changes once I hear some feedback from this group.