RE: [IndexedDB] Languages for collation
From: Jungshik Shin (신정식, 申政湜) [mailto:jungs...@google.com] Sent: Tuesday, August 24, 2010 10:34 PM As for the locale identifiers, my understanding is that Windows APIs (newer 'name-based' locale APIs) more or less follows BCP 47. Picking this back up from this August thread. I went around and asked Windows folks about this. Locale identifiers based on BCP 47 sound good. On the other hand, we probably wouldn't do UCA. I heard various worries from folks that work in this space, including the fact that it seems it's still changing so it would be a moving target (which btw means that collisions could still happen) and that we don't support it in a number of places today. Given that feedback, I would rather leave this open and let implementations choose the algorithm for collation (still need to do language-sensitive collation, of course). Would that work? Thanks -pablo
Re: [IndexedDB] Languages for collation
On Tue, Aug 17, 2010 at 12:02 AM, Jonas Sicking jo...@sicking.cc wrote: On Mon, Aug 16, 2010 at 2:20 AM, Jeremy Orlow jor...@chromium.org wrote: However I think it's very rare that this will be needed. And there are ways to somewhat work around it by using separate databases. So I would probably say that lets keep it database-wide for now, and reconsider in version 2. On the other hand, is there any reason not to make it per-objectStore/index? As far as I can tell, it should actually be fairly light weight form an API point of view: we can just add it as an optional parameter to createObjectStore/createIndex. From an implementation point of view, I really don't see this being much overhead either. So maybe we should just do it? I don't feel very strongly. Though I'd want to check that this is actually pretty easy to do implementation wise. Given that I think this is a low-value feature, I'd want to make sure it's low-cost too. How will we check? And should we really be basing decisions off of what's easiest to do implementation wise? And is this truly a low value feature? By check I meant talk to Ben and Shawn who actually knows how our implementation works in detail. So the result is that in our current architecture we can't support different collations for different objectStores. Come to think of it, it's the same for us. But that's not to say that it couldn't be done another way. And implementation should be a very minor worry for us. But given that we think multiple databases will be a good work around, I'm fine sticking with a per-database setting as Pablo originally proposed. We can support changing collation in an existing database though. It will be a very slow operation, but it's needed to avoid forcing authors to delete an existing database and recreate a new one with a new collation. By low value I mean that no one has presented a use case that requires it. The alternative is to add a function within setVersion to set the language which actually seems less elegant. I don't understand what you mean by this. Have a setLanguage method on IDBDatabase that can only be called from within a setVersion transaction. In the same way removeObjectStore and company can only be called within a setVersion transaction. That would work. So effectively this function would modify all the data in all the objectStores and indexes such that it's now sorted according to the new collation. The 'success' event is fired after all data has been updated. Any requests made after the setLanguage call will see the modified data. Is that the idea? I'm not married to any of the particulars, but yeah that is the general idea.
Re: [IndexedDB] Languages for collation
On Tue, Aug 17, 2010 at 12:37 AM, Jungshik Shin (신정식, 申政湜) jungs...@google.com wrote: + adding the authors of BCP 47 (Mark Davis and Addison Phillips) and Richard Ishida (w3c i18n) On Mon, Aug 16, 2010 at 4:03 PM, Jonas Sicking jo...@sicking.cc wrote: On Mon, Aug 16, 2010 at 10:11 AM, Jeremy Orlow jor...@chromium.org wrote: 2 additional questions: What standard will define the language codes and the associated collation algorithm? Very good questions. Are there specifications for this stuff elsewhere? As for the language code, we already have BCP 47. See http://www.rfc-editor.org/rfc/bcp/bcp47.txt The Registry http://www.iana.org/assignments/language-subtag-registry http://unicode.org/reports/tr35/#BCP47 The collation algorithm should be based on UCA ( http://unicode.org/reports/tr10/ ) with locale-specific tailoring coming from CLDR (http://cldr.unicode.org ) And what's the behavior for an implementation that doesn't support that particular language? http://unicode.org/reports/tr35/#BCP47 BCP 47 above defines a truncation/fallback mechanism. All the locales along the line of truncation/fallback fails, it'd eventually fall back to the UCA. Jungshik / Jonas Thanks for the response, Jungshik! Referencing this stuff looks good for the spec side of things. Do you know anything about the implementation side, by chance? In other words are there any standard libraries that we can use for all of this? (Ideally BSD, LGPL, or similarly licensed? :-) J
Re: [IndexedDB] Languages for collation
On Mon, Aug 16, 2010 at 12:09 AM, Jonas Sicking jo...@sicking.cc wrote: On Fri, Aug 13, 2010 at 12:15 PM, Jeremy Orlow jor...@chromium.org wrote: On Fri, Aug 13, 2010 at 5:02 PM, Jonas Sicking jo...@sicking.cc wrote: On Fri, Aug 13, 2010 at 4:56 AM, Jeremy Orlow jor...@chromium.org wrote: On Fri, Aug 13, 2010 at 1:31 AM, Pablo Castro pablo.cas...@microsoft.com wrote: From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy Orlow Sent: Thursday, August 12, 2010 2:18 AM I think we should first break down the use cases and look at how many of them just need _a_ sort order, how many of them a per-database sort order is ok, and how many of them would need something finer grained (like a per-key ordering). That's reasonable. What I was thinking is that any case where you'll use the order of items in a store/index to display things to the user (e.g. a list of contacts) you'd want the items to be in proper order for the user's language. That will not only match users' expectations but also match other applications (or even other parts of the UA) that display data based on the current OS language or the users' choice of language. That covers a very broad spectrum of scenarios that need language-specific sort order. I find it unlikely that a single web app will need more than one language per database (or even per origin/OS account), given that most applications operate in a single language at any one point in time. A lot of people are multi-lingual and I'm sure there will be at least some apps that need different data sorted in different ways for each language used. It's quite likely that such apps could use multiple databases as a work-around though. (As long as they don't need to execute transactions between them.) I can give some input as a multi-lingual person here. The only time I've used multiple languages at the same time in an application is for spell checking. In my browser I sometimes end up with setting the language in one textbox to swedish, and another to english. It's often annoying how poorly this use case is supported in applications actually. However I've never been in a situation where I've wanted some lists sorted in swedish and some in english. Possibly you would want to have spelling suggestions for a swedish textbox sorted in swedish order, and spelling suggestions for an english textbox sorted in english order. Though I think it wouldn't be much problem to have the different dictionaries in different databases. From an API point of view I think it would be pretty easy to support setting collation for individual objectStores. All we'd need is something like: interface IDBObjectStore { ... IDBRequest setSortingLanguage(in DOMString languageCode); IDBRequest getSortingLanguage(); ... }; To call setSortingLanguage you'd need READ_WRITE access. It acts just like any other writing request, with the only difference that it can take a lng time to execute. We could even add these functions to IDBIndex to allow the same data to be sorted in different ways at the same time. Why not put it behind setVersion and just make it an optional parameter when creating objectStores and indexes? I agree with Pablo that these things really shouldn't be changing much--in fact, maybe it's not worth making them modifiable at all (without rebuilding a new objectStore/index yourself). What is the advantage of this approach? It seems more cumbersome for authors. It brings back memories of the days when you had to recreate a SQL table to add a column to it. The advantage is that the API is more clear from a syntactic and performance impact standpoint. If you felt strongly, we could add a modifyObjectStore/modifyIndex method, but I don't think it's necessary. However I think it's very rare that this will be needed. And there are ways to somewhat work around it by using separate databases. So I would probably say that lets keep it database-wide for now, and reconsider in version 2. On the other hand, is there any reason not to make it per-objectStore/index? As far as I can tell, it should actually be fairly light weight form an API point of view: we can just add it as an optional parameter to createObjectStore/createIndex. From an implementation point of view, I really don't see this being much overhead either. So maybe we should just do it? I don't feel very strongly. Though I'd want to check that this is actually pretty easy to do implementation wise. Given that I think this is a low-value feature, I'd want to make sure it's low-cost too. How will we check? And should we really be basing decisions off of what's easiest to do implementation wise? And is this truly a low value feature? The alternative is to add a function
Re: [IndexedDB] Languages for collation
On Mon, Aug 16, 2010 at 2:20 AM, Jeremy Orlow jor...@chromium.org wrote: However I think it's very rare that this will be needed. And there are ways to somewhat work around it by using separate databases. So I would probably say that lets keep it database-wide for now, and reconsider in version 2. On the other hand, is there any reason not to make it per-objectStore/index? As far as I can tell, it should actually be fairly light weight form an API point of view: we can just add it as an optional parameter to createObjectStore/createIndex. From an implementation point of view, I really don't see this being much overhead either. So maybe we should just do it? I don't feel very strongly. Though I'd want to check that this is actually pretty easy to do implementation wise. Given that I think this is a low-value feature, I'd want to make sure it's low-cost too. How will we check? And should we really be basing decisions off of what's easiest to do implementation wise? And is this truly a low value feature? By check I meant talk to Ben and Shawn who actually knows how our implementation works in detail. So the result is that in our current architecture we can't support different collations for different objectStores. We can support changing collation in an existing database though. It will be a very slow operation, but it's needed to avoid forcing authors to delete an existing database and recreate a new one with a new collation. By low value I mean that no one has presented a use case that requires it. The alternative is to add a function within setVersion to set the language which actually seems less elegant. I don't understand what you mean by this. Have a setLanguage method on IDBDatabase that can only be called from within a setVersion transaction. In the same way removeObjectStore and company can only be called within a setVersion transaction. That would work. So effectively this function would modify all the data in all the objectStores and indexes such that it's now sorted according to the new collation. The 'success' event is fired after all data has been updated. Any requests made after the setLanguage call will see the modified data. Is that the idea? / Jonas
Re: [IndexedDB] Languages for collation
On Mon, Aug 16, 2010 at 10:11 AM, Jeremy Orlow jor...@chromium.org wrote: 2 additional questions: What standard will define the language codes and the associated collation algorithm? And what's the behavior for an implementation that doesn't support that particular language? Very good questions. Are there specifications for this stuff elsewhere? / Jonas
Re: [IndexedDB] Languages for collation
On Fri, Aug 13, 2010 at 12:15 PM, Jeremy Orlow jor...@chromium.org wrote: On Fri, Aug 13, 2010 at 5:02 PM, Jonas Sicking jo...@sicking.cc wrote: On Fri, Aug 13, 2010 at 4:56 AM, Jeremy Orlow jor...@chromium.org wrote: On Fri, Aug 13, 2010 at 1:31 AM, Pablo Castro pablo.cas...@microsoft.com wrote: From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy Orlow Sent: Thursday, August 12, 2010 2:18 AM I think we should first break down the use cases and look at how many of them just need _a_ sort order, how many of them a per-database sort order is ok, and how many of them would need something finer grained (like a per-key ordering). That's reasonable. What I was thinking is that any case where you'll use the order of items in a store/index to display things to the user (e.g. a list of contacts) you'd want the items to be in proper order for the user's language. That will not only match users' expectations but also match other applications (or even other parts of the UA) that display data based on the current OS language or the users' choice of language. That covers a very broad spectrum of scenarios that need language-specific sort order. I find it unlikely that a single web app will need more than one language per database (or even per origin/OS account), given that most applications operate in a single language at any one point in time. A lot of people are multi-lingual and I'm sure there will be at least some apps that need different data sorted in different ways for each language used. It's quite likely that such apps could use multiple databases as a work-around though. (As long as they don't need to execute transactions between them.) I can give some input as a multi-lingual person here. The only time I've used multiple languages at the same time in an application is for spell checking. In my browser I sometimes end up with setting the language in one textbox to swedish, and another to english. It's often annoying how poorly this use case is supported in applications actually. However I've never been in a situation where I've wanted some lists sorted in swedish and some in english. Possibly you would want to have spelling suggestions for a swedish textbox sorted in swedish order, and spelling suggestions for an english textbox sorted in english order. Though I think it wouldn't be much problem to have the different dictionaries in different databases. From an API point of view I think it would be pretty easy to support setting collation for individual objectStores. All we'd need is something like: interface IDBObjectStore { ... IDBRequest setSortingLanguage(in DOMString languageCode); IDBRequest getSortingLanguage(); ... }; To call setSortingLanguage you'd need READ_WRITE access. It acts just like any other writing request, with the only difference that it can take a lng time to execute. We could even add these functions to IDBIndex to allow the same data to be sorted in different ways at the same time. Why not put it behind setVersion and just make it an optional parameter when creating objectStores and indexes? I agree with Pablo that these things really shouldn't be changing much--in fact, maybe it's not worth making them modifiable at all (without rebuilding a new objectStore/index yourself). What is the advantage of this approach? It seems more cumbersome for authors. It brings back memories of the days when you had to recreate a SQL table to add a column to it. However I think it's very rare that this will be needed. And there are ways to somewhat work around it by using separate databases. So I would probably say that lets keep it database-wide for now, and reconsider in version 2. On the other hand, is there any reason not to make it per-objectStore/index? As far as I can tell, it should actually be fairly light weight form an API point of view: we can just add it as an optional parameter to createObjectStore/createIndex. From an implementation point of view, I really don't see this being much overhead either. So maybe we should just do it? I don't feel very strongly. Though I'd want to check that this is actually pretty easy to do implementation wise. Given that I think this is a low-value feature, I'd want to make sure it's low-cost too. The alternative is to add a function within setVersion to set the language which actually seems less elegant. I don't understand what you mean by this. / Jonas
Re: [IndexedDB] Languages for collation
On Fri, Aug 13, 2010 at 1:31 AM, Pablo Castro pablo.cas...@microsoft.comwrote: From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy Orlow Sent: Thursday, August 12, 2010 2:18 AM I think we should first break down the use cases and look at how many of them just need _a_ sort order, how many of them a per-database sort order is ok, and how many of them would need something finer grained (like a per-key ordering). That's reasonable. What I was thinking is that any case where you'll use the order of items in a store/index to display things to the user (e.g. a list of contacts) you'd want the items to be in proper order for the user's language. That will not only match users' expectations but also match other applications (or even other parts of the UA) that display data based on the current OS language or the users' choice of language. That covers a very broad spectrum of scenarios that need language-specific sort order. I find it unlikely that a single web app will need more than one language per database (or even per origin/OS account), given that most applications operate in a single language at any one point in time. A lot of people are multi-lingual and I'm sure there will be at least some apps that need different data sorted in different ways for each language used. It's quite likely that such apps could use multiple databases as a work-around though. (As long as they don't need to execute transactions between them.) Are there work-arounds for getting an UCA ordered data structure to hold data other language's order? For example, I could imagine it'd be possible to do some sort of encode step on the data before insertion (and decode on removal) that would make UCA work. I have no idea, but if such algorithms existed and were well understood, then it'd definitely make me lean towards punting language specification to v2. I'm not sure I understand this paragraph. UCA ordered may not mean much more than just ordering using a binary collation if the language is not specified. While this is typically not an issue in English, in other languages this introduces a varying level of deviation from users' expectations. Given that different languages have conflicting rules for collation, I'm not sure how this can be generalized independently of the language. Even in the UCA specification [1] the aspect of input language is mentioned as the most important feature of collation. I understand that. What I was asking is whether there are hacks to make it work anyway. i.e. ways to encode/decode the data going in/out. In other words, what's stored as the key would not be exactly the word you put in, but you'd know how to undo the process on the way out. After thinking about it for a couple minutes, I've got some ideas on how to do it, but they're not terribly lightweight. Btw, my intuition is also that a database level control is the right way to go here, but I just want to make sure we've properly considered the pros and cons of the other possibilities. J
Re: [IndexedDB] Languages for collation
On Fri, Aug 13, 2010 at 4:56 AM, Jeremy Orlow jor...@chromium.org wrote: On Fri, Aug 13, 2010 at 1:31 AM, Pablo Castro pablo.cas...@microsoft.com wrote: From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy Orlow Sent: Thursday, August 12, 2010 2:18 AM I think we should first break down the use cases and look at how many of them just need _a_ sort order, how many of them a per-database sort order is ok, and how many of them would need something finer grained (like a per-key ordering). That's reasonable. What I was thinking is that any case where you'll use the order of items in a store/index to display things to the user (e.g. a list of contacts) you'd want the items to be in proper order for the user's language. That will not only match users' expectations but also match other applications (or even other parts of the UA) that display data based on the current OS language or the users' choice of language. That covers a very broad spectrum of scenarios that need language-specific sort order. I find it unlikely that a single web app will need more than one language per database (or even per origin/OS account), given that most applications operate in a single language at any one point in time. A lot of people are multi-lingual and I'm sure there will be at least some apps that need different data sorted in different ways for each language used. It's quite likely that such apps could use multiple databases as a work-around though. (As long as they don't need to execute transactions between them.) I can give some input as a multi-lingual person here. The only time I've used multiple languages at the same time in an application is for spell checking. In my browser I sometimes end up with setting the language in one textbox to swedish, and another to english. It's often annoying how poorly this use case is supported in applications actually. However I've never been in a situation where I've wanted some lists sorted in swedish and some in english. Possibly you would want to have spelling suggestions for a swedish textbox sorted in swedish order, and spelling suggestions for an english textbox sorted in english order. Though I think it wouldn't be much problem to have the different dictionaries in different databases. From an API point of view I think it would be pretty easy to support setting collation for individual objectStores. All we'd need is something like: interface IDBObjectStore { ... IDBRequest setSortingLanguage(in DOMString languageCode); IDBRequest getSortingLanguage(); ... }; To call setSortingLanguage you'd need READ_WRITE access. It acts just like any other writing request, with the only difference that it can take a lng time to execute. We could even add these functions to IDBIndex to allow the same data to be sorted in different ways at the same time. However I think it's very rare that this will be needed. And there are ways to somewhat work around it by using separate databases. So I would probably say that lets keep it database-wide for now, and reconsider in version 2. / Jonas
Re: [IndexedDB] Languages for collation
On Fri, Aug 13, 2010 at 5:02 PM, Jonas Sicking jo...@sicking.cc wrote: On Fri, Aug 13, 2010 at 4:56 AM, Jeremy Orlow jor...@chromium.org wrote: On Fri, Aug 13, 2010 at 1:31 AM, Pablo Castro pablo.cas...@microsoft.com wrote: From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy Orlow Sent: Thursday, August 12, 2010 2:18 AM I think we should first break down the use cases and look at how many of them just need _a_ sort order, how many of them a per-database sort order is ok, and how many of them would need something finer grained (like a per-key ordering). That's reasonable. What I was thinking is that any case where you'll use the order of items in a store/index to display things to the user (e.g. a list of contacts) you'd want the items to be in proper order for the user's language. That will not only match users' expectations but also match other applications (or even other parts of the UA) that display data based on the current OS language or the users' choice of language. That covers a very broad spectrum of scenarios that need language-specific sort order. I find it unlikely that a single web app will need more than one language per database (or even per origin/OS account), given that most applications operate in a single language at any one point in time. A lot of people are multi-lingual and I'm sure there will be at least some apps that need different data sorted in different ways for each language used. It's quite likely that such apps could use multiple databases as a work-around though. (As long as they don't need to execute transactions between them.) I can give some input as a multi-lingual person here. The only time I've used multiple languages at the same time in an application is for spell checking. In my browser I sometimes end up with setting the language in one textbox to swedish, and another to english. It's often annoying how poorly this use case is supported in applications actually. However I've never been in a situation where I've wanted some lists sorted in swedish and some in english. Possibly you would want to have spelling suggestions for a swedish textbox sorted in swedish order, and spelling suggestions for an english textbox sorted in english order. Though I think it wouldn't be much problem to have the different dictionaries in different databases. From an API point of view I think it would be pretty easy to support setting collation for individual objectStores. All we'd need is something like: interface IDBObjectStore { ... IDBRequest setSortingLanguage(in DOMString languageCode); IDBRequest getSortingLanguage(); ... }; To call setSortingLanguage you'd need READ_WRITE access. It acts just like any other writing request, with the only difference that it can take a lng time to execute. We could even add these functions to IDBIndex to allow the same data to be sorted in different ways at the same time. Why not put it behind setVersion and just make it an optional parameter when creating objectStores and indexes? I agree with Pablo that these things really shouldn't be changing much--in fact, maybe it's not worth making them modifiable at all (without rebuilding a new objectStore/index yourself). However I think it's very rare that this will be needed. And there are ways to somewhat work around it by using separate databases. So I would probably say that lets keep it database-wide for now, and reconsider in version 2. On the other hand, is there any reason not to make it per-objectStore/index? As far as I can tell, it should actually be fairly light weight form an API point of view: we can just add it as an optional parameter to createObjectStore/createIndex. From an implementation point of view, I really don't see this being much overhead either. So maybe we should just do it? The alternative is to add a function within setVersion to set the language which actually seems less elegant. J
Re: [IndexedDB] Languages for collation
Why not just use the unicode collation algorithm? Then you won't have to hint the locale. http://en.wikipedia.org/wiki/Unicode_collation_algorithm CouchDB uses some definitions around sorting complex types like arrays and objects but when it comes down to sorting strings it just defaults to to the unicode collation algorithm and all the locale's are happy. -Mikeal On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro pablo.cas...@microsoft.comwrote: We had some discussions about collation algorithms and such in the past, but I don't think we have settled on the language aspect of it. In order to have stores and indexes sort character-based keys in a way that is consistent with users' expectations we'll have to take indication in the API of what language we should use to collate strings. Trying to take a minimalist approach, we could add an optional parameter on the database open call that indicates the language to use (e.g. en or en-UK, etc.). If the language is not specified and the database does not exist, then we can use the current browser/OS language to create the database. If not specified and database already exists, then use the one it's already there (this accommodates the fact that a user may be able to change their default language in the browser/OS after the database has been created using the default). If the language is specified and the database already exists and the specified language is not the one the database has then we'll throw an exception (same behavior as with description, although we have that one in flight right now as well). We should probably also add a read-only attribute to the database object that exposes the language. If this works for folks I can write a proposal for the specific changes to the spec. Thanks -pablo
RE: [IndexedDB] Languages for collation
From: Mikeal Rogers [mailto:mikeal.rog...@gmail.com] Sent: Wednesday, August 11, 2010 11:35 PM Why not just use the unicode collation algorithm? Then you won't have to hint the locale. Unless I'm missing something, the UCA defines the general algorithm for collating strings but you still need to know the language in order to sort strings properly in that language. For example, in Spanish the letters c and h together (e.g. in chau (bye)) sort as a single letter, causing the expected sort order to be different from English where they are always two independent letters (e.g. so chau comes before cuando (when) when sorted in English, but after when sorted in Spanish). http://en.wikipedia.org/wiki/Unicode_collation_algorithm CouchDB uses some definitions around sorting complex types like arrays and objects but when it comes down to sorting strings it just defaults to to the unicode collation algorithm and all the locale's are happy. -Mikeal On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro pablo.cas...@microsoft.com wrote: We had some discussions about collation algorithms and such in the past, but I don't think we have settled on the language aspect of it. In order to have stores and indexes sort character-based keys in a way that is consistent with users' expectations we'll have to take indication in the API of what language we should use to collate strings. Trying to take a minimalist approach, we could add an optional parameter on the database open call that indicates the language to use (e.g. en or en-UK, etc.). If the language is not specified and the database does not exist, then we can use the current browser/OS language to create the database. If not specified and database already exists, then use the one it's already there (this accommodates the fact that a user may be able to change their default language in the browser/OS after the database has been created using the default). If the language is specified and the database already exists and the specified language is not the one the database has then we'll throw an exception (same behavior as with description, although we have that one in flight right now as well). We should probably also add a read-only attribute to the database object that exposes the language. If this works for folks I can write a proposal for the specific changes to the spec. Thanks -pablo
Re: [IndexedDB] Languages for collation
On Thu, Aug 12, 2010 at 8:28 AM, Pablo Castro pablo.cas...@microsoft.comwrote: From: Mikeal Rogers [mailto:mikeal.rog...@gmail.com] Sent: Wednesday, August 11, 2010 11:35 PM Why not just use the unicode collation algorithm? Then you won't have to hint the locale. Unless I'm missing something, the UCA defines the general algorithm for collating strings but you still need to know the language in order to sort strings properly in that language. For example, in Spanish the letters c and h together (e.g. in chau (bye)) sort as a single letter, causing the expected sort order to be different from English where they are always two independent letters (e.g. so chau comes before cuando (when) when sorted in English, but after when sorted in Spanish). http://en.wikipedia.org/wiki/Unicode_collation_algorithm CouchDB uses some definitions around sorting complex types like arrays and objects but when it comes down to sorting strings it just defaults to to the unicode collation algorithm and all the locale's are happy. -Mikeal On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro pablo.cas...@microsoft.com wrote: We had some discussions about collation algorithms and such in the past, but I don't think we have settled on the language aspect of it. In order to have stores and indexes sort character-based keys in a way that is consistent with users' expectations we'll have to take indication in the API of what language we should use to collate strings. Trying to take a minimalist approach, we could add an optional parameter on the database open call that indicates the language to use (e.g. en or en-UK, etc.). If the language is not specified and the database does not exist, then we can use the current browser/OS language to create the database. If not specified and database already exists, then use the one it's already there (this accommodates the fact that a user may be able to change their default language in the browser/OS after the database has been created using the default). If the language is specified and the database already exists and the specified language is not the one the database has then we'll throw an exception (same behavior as with description, although we have that one in flight right now as well). We should probably also add a read-only attribute to the database object that exposes the language. I think we should first break down the use cases and look at how many of them just need _a_ sort order, how many of them a per-database sort order is ok, and how many of them would need something finer grained (like a per-key ordering). Are there work-arounds for getting an UCA ordered data structure to hold data other language's order? For example, I could imagine it'd be possible to do some sort of encode step on the data before insertion (and decode on removal) that would make UCA work. I have no idea, but if such algorithms existed and were well understood, then it'd definitely make me lean towards punting language specification to v2. J If this works for folks I can write a proposal for the specific changes to the spec. Thanks -pablo
Re: [IndexedDB] Languages for collation
On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro pablo.cas...@microsoft.com wrote: We had some discussions about collation algorithms and such in the past, but I don't think we have settled on the language aspect of it. In order to have stores and indexes sort character-based keys in a way that is consistent with users' expectations we'll have to take indication in the API of what language we should use to collate strings. Trying to take a minimalist approach, we could add an optional parameter on the database open call that indicates the language to use (e.g. en or en-UK, etc.). If the language is not specified and the database does not exist, then we can use the current browser/OS language to create the database. If not specified and database already exists, then use the one it's already there (this accommodates the fact that a user may be able to change their default language in the browser/OS after the database has been created using the default). If the language is specified and the database already exists and the specified language is not the one the database has then we'll throw an exception (same behavior as with description, although we have that one in flight right now as well). We should probably also add a read-only attribute to the database object that exposes the language. If this works for folks I can write a proposal for the specific changes to the spec. If we make it part of the database open call, then that makes it impossible to change the sorting order of an existing database, no? This seems like it could be a problem. I.e. it quite possible that an application will want to allow the user to change the sorting language, for example when changing the language of the UI. One solution would be to allow language to be set as part of the setVersion call. / Jonas
Re: [IndexedDB] Languages for collation
On Thu, Aug 12, 2010 at 11:19 AM, Jonas Sicking jo...@sicking.cc wrote: On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro pablo.cas...@microsoft.com wrote: We had some discussions about collation algorithms and such in the past, but I don't think we have settled on the language aspect of it. In order to have stores and indexes sort character-based keys in a way that is consistent with users' expectations we'll have to take indication in the API of what language we should use to collate strings. Trying to take a minimalist approach, we could add an optional parameter on the database open call that indicates the language to use (e.g. en or en-UK, etc.). If the language is not specified and the database does not exist, then we can use the current browser/OS language to create the database. If not specified and database already exists, then use the one it's already there (this accommodates the fact that a user may be able to change their default language in the browser/OS after the database has been created using the default). If the language is specified and the database already exists and the specified language is not the one the database has then we'll throw an exception (same behavior as with description, although we have that one in flight right now as well). We should probably also add a read-only attribute to the database object that exposes the language. If this works for folks I can write a proposal for the specific changes to the spec. If we make it part of the database open call, then that makes it impossible to change the sorting order of an existing database, no? This seems like it could be a problem. I.e. it quite possible that an application will want to allow the user to change the sorting language, for example when changing the language of the UI. One solution would be to allow language to be set as part of the setVersion call. Whether it's per-database or more fine grained I think it absolutely must be part of setVersion. Changing the language will be a very heavyweight operation that'll require a similar level of isolation to schema changes of the database. (Not sure how I missed this point of Pablo's original email.) J
RE: [IndexedDB] Languages for collation
From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy Orlow Sent: Thursday, August 12, 2010 2:18 AM I think we should first break down the use cases and look at how many of them just need _a_ sort order, how many of them a per-database sort order is ok, and how many of them would need something finer grained (like a per-key ordering). That's reasonable. What I was thinking is that any case where you'll use the order of items in a store/index to display things to the user (e.g. a list of contacts) you'd want the items to be in proper order for the user's language. That will not only match users' expectations but also match other applications (or even other parts of the UA) that display data based on the current OS language or the users' choice of language. That covers a very broad spectrum of scenarios that need language-specific sort order. I find it unlikely that a single web app will need more than one language per database (or even per origin/OS account), given that most applications operate in a single language at any one point in time. Are there work-arounds for getting an UCA ordered data structure to hold data other language's order? For example, I could imagine it'd be possible to do some sort of encode step on the data before insertion (and decode on removal) that would make UCA work. I have no idea, but if such algorithms existed and were well understood, then it'd definitely make me lean towards punting language specification to v2. I'm not sure I understand this paragraph. UCA ordered may not mean much more than just ordering using a binary collation if the language is not specified. While this is typically not an issue in English, in other languages this introduces a varying level of deviation from users' expectations. Given that different languages have conflicting rules for collation, I'm not sure how this can be generalized independently of the language. Even in the UCA specification [1] the aspect of input language is mentioned as the most important feature of collation. [1] http://www.unicode.org/reports/tr10/
RE: [IndexedDB] Languages for collation
From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy Orlow Sent: Thursday, August 12, 2010 3:36 AM On Thu, Aug 12, 2010 at 11:19 AM, Jonas Sicking jo...@sicking.cc wrote: On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro pablo.cas...@microsoft.com wrote: We had some discussions about collation algorithms and such in the past, but I don't think we have settled on the language aspect of it. In order to have stores and indexes sort character-based keys in a way that is consistent with users' expectations we'll have to take indication in the API of what language we should use to collate strings. Trying to take a minimalist approach, we could add an optional parameter on the database open call that indicates the language to use (e.g. en or en-UK, etc.). If the language is not specified and the database does not exist, then we can use the current browser/OS language to create the database. If not specified and database already exists, then use the one it's already there (this accommodates the fact that a user may be able to change their default language in the browser/OS after the database has been created using the default). If the language is specified and the database already exists and the specified language is not the one the database has then we'll throw an exception (same behavior as with description, although we have that one in flight right now as well). We should probably also add a read-only attribute to the database object that exposes the language. If this works for folks I can write a proposal for the specific changes to the spec. If we make it part of the database open call, then that makes it impossible to change the sorting order of an existing database, no? This seems like it could be a problem. I.e. it quite possible that an application will want to allow the user to change the sorting language, for example when changing the language of the UI. One solution would be to allow language to be set as part of the setVersion call. Whether it's per-database or more fine grained I think it absolutely must be part of setVersion. Changing the language will be a very heavyweight operation that'll require a similar level of isolation to schema changes of the database. (Not sure how I missed this point of Pablo's original email.) Yes, changing the collection would effectively mean re-creating all the stores and indexes. At a very minimum it needs to be a setVersion thing. I also don't think it would be too crazy to not support changing collations period. In the unusual case where a user must absolutely do this, it can be done by creating a separate database and copying the data over using the APIs.