RE: [IndexedDB] Languages for collation

2010-09-29 Thread Pablo Castro

From: Jungshik Shin (신정식, 申政湜) [mailto:jungs...@google.com] 
Sent: Tuesday, August 24, 2010 10:34 PM

 As for the locale identifiers, my understanding is that Windows APIs (newer 
 'name-based' locale APIs) more or less follows BCP 47. 


Picking this back up from this August thread. I went around and asked Windows 
folks about this. Locale identifiers based on BCP 47 sound good.

On the other hand, we probably wouldn't do UCA. I heard various worries from 
folks that work in this space, including the fact that it seems it's still 
changing so it would be a moving target (which btw means that collisions could 
still happen) and that we don't support it in a number of places today. Given 
that feedback, I would rather leave this open and let implementations choose 
the algorithm for collation (still need to do language-sensitive collation, of 
course). Would that work?

Thanks
-pablo
 


Re: [IndexedDB] Languages for collation

2010-08-17 Thread Jeremy Orlow
On Tue, Aug 17, 2010 at 12:02 AM, Jonas Sicking jo...@sicking.cc wrote:

 On Mon, Aug 16, 2010 at 2:20 AM, Jeremy Orlow jor...@chromium.org wrote:
   However I think it's very rare that this will be needed. And there
 are
   ways to somewhat work around it by using separate databases. So I
   would probably say that lets keep it database-wide for now, and
   reconsider in version 2.
  
   On the other hand, is there any reason not to make it
   per-objectStore/index?
   As far as I can tell, it should actually be fairly light weight form
 an
   API
   point of view: we can just add it as an optional parameter to
   createObjectStore/createIndex.  From an implementation point of view,
 I
   really don't see this being much overhead either.  So maybe we should
   just
   do it?
 
  I don't feel very strongly. Though I'd want to check that this is
  actually pretty easy to do implementation wise. Given that I think
  this is a low-value feature, I'd want to make sure it's low-cost too.
 
  How will we check?  And should we really be basing decisions off of
 what's
  easiest to do implementation wise?  And is this truly a low value
 feature?

 By check I meant talk to Ben and Shawn who actually knows how our
 implementation works in detail. So the result is that in our current
 architecture we can't support different collations for different
 objectStores.


Come to think of it, it's the same for us.  But that's not to say that it
couldn't be done another way.  And implementation should be a very minor
worry for us.  But given that we think multiple databases will be a good
work around, I'm fine sticking with a per-database setting as Pablo
originally proposed.


 We can support changing collation in an existing
 database though. It will be a very slow operation, but it's needed to
 avoid forcing authors to delete an existing database and recreate a
 new one with a new collation.

 By low value I mean that no one has presented a use case that requires
 it.

   The alternative is to add a function within setVersion to set the
   language
   which actually seems less elegant.
 
  I don't understand what you mean by this.
 
  Have a setLanguage method on IDBDatabase that can only be called from
 within
  a setVersion transaction.  In the same way removeObjectStore and company
 can
  only be called within a setVersion transaction.

 That would work. So effectively this function would modify all the
 data in all the objectStores and indexes such that it's now sorted
 according to the new collation. The 'success' event is fired after all
 data has been updated. Any requests made after the setLanguage call
 will see the modified data.

 Is that the idea?


I'm not married to any of the particulars, but yeah that is the general
idea.


Re: [IndexedDB] Languages for collation

2010-08-17 Thread Jeremy Orlow
On Tue, Aug 17, 2010 at 12:37 AM, Jungshik Shin (신정식, 申政湜) 
jungs...@google.com wrote:

 + adding the authors of BCP 47 (Mark Davis and Addison Phillips) and
 Richard Ishida (w3c i18n)

 On Mon, Aug 16, 2010 at 4:03 PM, Jonas Sicking jo...@sicking.cc wrote:

 On Mon, Aug 16, 2010 at 10:11 AM, Jeremy Orlow jor...@chromium.org
 wrote:
  2 additional questions:  What standard will define the language codes
 and
  the associated collation algorithm?

 Very good questions. Are there specifications for this stuff elsewhere?


 As for the language code, we already have BCP 47. See

 http://www.rfc-editor.org/rfc/bcp/bcp47.txt

 The Registry
 http://www.iana.org/assignments/language-subtag-registry

 http://unicode.org/reports/tr35/#BCP47

 The collation algorithm should be based on UCA (
 http://unicode.org/reports/tr10/ ) with locale-specific tailoring coming
 from CLDR (http://cldr.unicode.org )


   And what's the behavior for an
  implementation that doesn't support that particular language?

 http://unicode.org/reports/tr35/#BCP47 BCP 47 above defines a
 truncation/fallback mechanism. All the locales along the line of
 truncation/fallback fails, it'd eventually fall back to the UCA.

 Jungshik


 / Jonas



Thanks for the response, Jungshik!  Referencing this stuff looks good for
the spec side of things.  Do you know anything about the implementation
side, by chance?  In other words are there any standard libraries that we
can use for all of this?  (Ideally BSD, LGPL, or similarly licensed? :-)

J


Re: [IndexedDB] Languages for collation

2010-08-16 Thread Jeremy Orlow
On Mon, Aug 16, 2010 at 12:09 AM, Jonas Sicking jo...@sicking.cc wrote:

 On Fri, Aug 13, 2010 at 12:15 PM, Jeremy Orlow jor...@chromium.org
 wrote:
  On Fri, Aug 13, 2010 at 5:02 PM, Jonas Sicking jo...@sicking.cc wrote:
 
  On Fri, Aug 13, 2010 at 4:56 AM, Jeremy Orlow jor...@chromium.org
 wrote:
   On Fri, Aug 13, 2010 at 1:31 AM, Pablo Castro
   pablo.cas...@microsoft.com
   wrote:
  
   From: jor...@google.com [mailto:jor...@google.com] On Behalf Of
 Jeremy
   Orlow
   Sent: Thursday, August 12, 2010 2:18 AM
  
I think we should first break down the use cases and look at how
many
of them just need _a_ sort order, how many of them a per-database
sort order
is ok, and how many of them would need something finer grained
 (like
a
per-key ordering).
  
   That's reasonable. What I was thinking is that any case where you'll
   use
   the order of items in a store/index to display things to the user
 (e.g.
   a
   list of contacts) you'd want the items to be in proper order  for the
   user's
   language. That will not only match users' expectations but also match
   other
   applications (or even other parts of the UA) that display data based
 on
   the
   current OS language or the users' choice of language.
  
   That covers a very broad spectrum of scenarios that need
   language-specific
   sort order.
  
   I find it unlikely that a single web app will need more than one
   language
   per database (or even per origin/OS account), given that most
   applications
   operate in a single language at any one point in time.
  
   A lot of people are multi-lingual and I'm sure there will be at least
   some
   apps that need different data sorted in different ways for each
 language
   used.  It's quite likely that such apps could use multiple databases
 as
   a
   work-around though.  (As long as they don't need to execute
 transactions
   between them.)
 
  I can give some input as a multi-lingual person here. The only time
  I've used multiple languages at the same time in an application is for
  spell checking. In my browser I sometimes end up with setting the
  language in one textbox to swedish, and another to english. It's often
  annoying how poorly this use case is supported in applications
  actually.
 
  However I've never been in a situation where I've wanted some lists
  sorted in swedish and some in english. Possibly you would want to have
  spelling suggestions for a swedish textbox sorted in swedish order,
  and spelling suggestions for an english textbox sorted in english
  order. Though I think it wouldn't be much problem to have the
  different dictionaries in different databases.
 
  From an API point of view I think it would be pretty easy to support
  setting collation for individual objectStores. All we'd need is
  something like:
 
  interface IDBObjectStore {
   ...
   IDBRequest setSortingLanguage(in DOMString languageCode);
   IDBRequest getSortingLanguage();
   ...
  };
 
  To call setSortingLanguage you'd need READ_WRITE access. It acts just
  like any other writing request, with the only difference that it can
  take a lng time to execute. We could even add these functions to
  IDBIndex to allow the same data to be sorted in different ways at the
  same time.
 
  Why not put it behind setVersion and just make it an optional parameter
 when
  creating objectStores and indexes?  I agree with Pablo that these things
  really shouldn't be changing much--in fact, maybe it's not worth making
  them modifiable at all (without rebuilding a new objectStore/index
  yourself).

 What is the advantage of this approach? It seems more cumbersome for
 authors. It brings back memories of the days when you had to recreate
 a SQL table to add a column to it.


The advantage is that the API is more clear from a syntactic and performance
impact standpoint.

If you felt strongly, we could add a modifyObjectStore/modifyIndex method,
but I don't think it's necessary.


  However I think it's very rare that this will be needed. And there are
  ways to somewhat work around it by using separate databases. So I
  would probably say that lets keep it database-wide for now, and
  reconsider in version 2.
 
  On the other hand, is there any reason not to make it
 per-objectStore/index?
  As far as I can tell, it should actually be fairly light weight form an
 API
  point of view: we can just add it as an optional parameter to
  createObjectStore/createIndex.  From an implementation point of view, I
  really don't see this being much overhead either.  So maybe we should
 just
  do it?

 I don't feel very strongly. Though I'd want to check that this is
 actually pretty easy to do implementation wise. Given that I think
 this is a low-value feature, I'd want to make sure it's low-cost too.


How will we check?  And should we really be basing decisions off of what's
easiest to do implementation wise?  And is this truly a low value feature?


  The alternative is to add a function 

Re: [IndexedDB] Languages for collation

2010-08-16 Thread Jonas Sicking
On Mon, Aug 16, 2010 at 2:20 AM, Jeremy Orlow jor...@chromium.org wrote:
  However I think it's very rare that this will be needed. And there are
  ways to somewhat work around it by using separate databases. So I
  would probably say that lets keep it database-wide for now, and
  reconsider in version 2.
 
  On the other hand, is there any reason not to make it
  per-objectStore/index?
  As far as I can tell, it should actually be fairly light weight form an
  API
  point of view: we can just add it as an optional parameter to
  createObjectStore/createIndex.  From an implementation point of view, I
  really don't see this being much overhead either.  So maybe we should
  just
  do it?

 I don't feel very strongly. Though I'd want to check that this is
 actually pretty easy to do implementation wise. Given that I think
 this is a low-value feature, I'd want to make sure it's low-cost too.

 How will we check?  And should we really be basing decisions off of what's
 easiest to do implementation wise?  And is this truly a low value feature?

By check I meant talk to Ben and Shawn who actually knows how our
implementation works in detail. So the result is that in our current
architecture we can't support different collations for different
objectStores. We can support changing collation in an existing
database though. It will be a very slow operation, but it's needed to
avoid forcing authors to delete an existing database and recreate a
new one with a new collation.

By low value I mean that no one has presented a use case that requires it.

  The alternative is to add a function within setVersion to set the
  language
  which actually seems less elegant.

 I don't understand what you mean by this.

 Have a setLanguage method on IDBDatabase that can only be called from within
 a setVersion transaction.  In the same way removeObjectStore and company can
 only be called within a setVersion transaction.

That would work. So effectively this function would modify all the
data in all the objectStores and indexes such that it's now sorted
according to the new collation. The 'success' event is fired after all
data has been updated. Any requests made after the setLanguage call
will see the modified data.

Is that the idea?

/ Jonas



Re: [IndexedDB] Languages for collation

2010-08-16 Thread Jonas Sicking
On Mon, Aug 16, 2010 at 10:11 AM, Jeremy Orlow jor...@chromium.org wrote:
 2 additional questions:  What standard will define the language codes and
 the associated collation algorithm?  And what's the behavior for an
 implementation that doesn't support that particular language?

Very good questions. Are there specifications for this stuff elsewhere?

/ Jonas



Re: [IndexedDB] Languages for collation

2010-08-15 Thread Jonas Sicking
On Fri, Aug 13, 2010 at 12:15 PM, Jeremy Orlow jor...@chromium.org wrote:
 On Fri, Aug 13, 2010 at 5:02 PM, Jonas Sicking jo...@sicking.cc wrote:

 On Fri, Aug 13, 2010 at 4:56 AM, Jeremy Orlow jor...@chromium.org wrote:
  On Fri, Aug 13, 2010 at 1:31 AM, Pablo Castro
  pablo.cas...@microsoft.com
  wrote:
 
  From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy
  Orlow
  Sent: Thursday, August 12, 2010 2:18 AM
 
   I think we should first break down the use cases and look at how
   many
   of them just need _a_ sort order, how many of them a per-database
   sort order
   is ok, and how many of them would need something finer grained (like
   a
   per-key ordering).
 
  That's reasonable. What I was thinking is that any case where you'll
  use
  the order of items in a store/index to display things to the user (e.g.
  a
  list of contacts) you'd want the items to be in proper order  for the
  user's
  language. That will not only match users' expectations but also match
  other
  applications (or even other parts of the UA) that display data based on
  the
  current OS language or the users' choice of language.
 
  That covers a very broad spectrum of scenarios that need
  language-specific
  sort order.
 
  I find it unlikely that a single web app will need more than one
  language
  per database (or even per origin/OS account), given that most
  applications
  operate in a single language at any one point in time.
 
  A lot of people are multi-lingual and I'm sure there will be at least
  some
  apps that need different data sorted in different ways for each language
  used.  It's quite likely that such apps could use multiple databases as
  a
  work-around though.  (As long as they don't need to execute transactions
  between them.)

 I can give some input as a multi-lingual person here. The only time
 I've used multiple languages at the same time in an application is for
 spell checking. In my browser I sometimes end up with setting the
 language in one textbox to swedish, and another to english. It's often
 annoying how poorly this use case is supported in applications
 actually.

 However I've never been in a situation where I've wanted some lists
 sorted in swedish and some in english. Possibly you would want to have
 spelling suggestions for a swedish textbox sorted in swedish order,
 and spelling suggestions for an english textbox sorted in english
 order. Though I think it wouldn't be much problem to have the
 different dictionaries in different databases.

 From an API point of view I think it would be pretty easy to support
 setting collation for individual objectStores. All we'd need is
 something like:

 interface IDBObjectStore {
  ...
  IDBRequest setSortingLanguage(in DOMString languageCode);
  IDBRequest getSortingLanguage();
  ...
 };

 To call setSortingLanguage you'd need READ_WRITE access. It acts just
 like any other writing request, with the only difference that it can
 take a lng time to execute. We could even add these functions to
 IDBIndex to allow the same data to be sorted in different ways at the
 same time.

 Why not put it behind setVersion and just make it an optional parameter when
 creating objectStores and indexes?  I agree with Pablo that these things
 really shouldn't be changing much--in fact, maybe it's not worth making
 them modifiable at all (without rebuilding a new objectStore/index
 yourself).

What is the advantage of this approach? It seems more cumbersome for
authors. It brings back memories of the days when you had to recreate
a SQL table to add a column to it.

 However I think it's very rare that this will be needed. And there are
 ways to somewhat work around it by using separate databases. So I
 would probably say that lets keep it database-wide for now, and
 reconsider in version 2.

 On the other hand, is there any reason not to make it per-objectStore/index?
 As far as I can tell, it should actually be fairly light weight form an API
 point of view: we can just add it as an optional parameter to
 createObjectStore/createIndex.  From an implementation point of view, I
 really don't see this being much overhead either.  So maybe we should just
 do it?

I don't feel very strongly. Though I'd want to check that this is
actually pretty easy to do implementation wise. Given that I think
this is a low-value feature, I'd want to make sure it's low-cost too.

 The alternative is to add a function within setVersion to set the language
 which actually seems less elegant.

I don't understand what you mean by this.

/ Jonas



Re: [IndexedDB] Languages for collation

2010-08-13 Thread Jeremy Orlow
On Fri, Aug 13, 2010 at 1:31 AM, Pablo Castro pablo.cas...@microsoft.comwrote:


 From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy
 Orlow
 Sent: Thursday, August 12, 2010 2:18 AM

  I think we should first break down the use cases and look at how many of
 them just need _a_ sort order, how many of them a per-database sort order is
 ok, and how many of them would need something finer grained (like a per-key
 ordering).

 That's reasonable. What I was thinking is that any case where you'll use
 the order of items in a store/index to display things to the user (e.g. a
 list of contacts) you'd want the items to be in proper order  for the user's
 language. That will not only match users' expectations but also match other
 applications (or even other parts of the UA) that display data based on the
 current OS language or the users' choice of language.

 That covers a very broad spectrum of scenarios that need language-specific
 sort order.

 I find it unlikely that a single web app will need more than one language
 per database (or even per origin/OS account), given that most applications
 operate in a single language at any one point in time.


A lot of people are multi-lingual and I'm sure there will be at least some
apps that need different data sorted in different ways for each language
used.  It's quite likely that such apps could use multiple databases as a
work-around though.  (As long as they don't need to execute transactions
between them.)


  Are there work-arounds for getting an UCA ordered data structure to hold
 data other language's order?  For example, I could imagine it'd be possible
 to do some sort of encode step on the data before insertion (and decode on
 removal) that would make UCA work.  I have no idea, but if such algorithms
 existed and were well understood, then it'd definitely make me lean towards
 punting language specification to v2.

 I'm not sure I understand this paragraph. UCA ordered may not mean much
 more than just ordering using a binary collation if the language is not
 specified. While this is typically not an issue in English, in other
 languages this introduces a varying level of deviation from users'
 expectations. Given that different languages have conflicting rules for
 collation, I'm not sure how this can be generalized independently of the
 language. Even in the UCA specification [1] the aspect of input language is
 mentioned as the most important feature of collation.


I understand that.  What I was asking is whether there are hacks to make it
work anyway.  i.e. ways to encode/decode the data going in/out.  In other
words, what's stored as the key would not be exactly the word you put in,
but you'd know how to undo the process on the way out.  After thinking about
it for a couple minutes, I've got some ideas on how to do it, but they're
not terribly lightweight.

Btw, my intuition is also that a database level control is the right way to
go here, but I just want to make sure we've properly considered the pros and
cons of the other possibilities.

J


Re: [IndexedDB] Languages for collation

2010-08-13 Thread Jonas Sicking
On Fri, Aug 13, 2010 at 4:56 AM, Jeremy Orlow jor...@chromium.org wrote:
 On Fri, Aug 13, 2010 at 1:31 AM, Pablo Castro pablo.cas...@microsoft.com
 wrote:

 From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy
 Orlow
 Sent: Thursday, August 12, 2010 2:18 AM

  I think we should first break down the use cases and look at how many
  of them just need _a_ sort order, how many of them a per-database sort 
  order
  is ok, and how many of them would need something finer grained (like a
  per-key ordering).

 That's reasonable. What I was thinking is that any case where you'll use
 the order of items in a store/index to display things to the user (e.g. a
 list of contacts) you'd want the items to be in proper order  for the user's
 language. That will not only match users' expectations but also match other
 applications (or even other parts of the UA) that display data based on the
 current OS language or the users' choice of language.

 That covers a very broad spectrum of scenarios that need language-specific
 sort order.

 I find it unlikely that a single web app will need more than one language
 per database (or even per origin/OS account), given that most applications
 operate in a single language at any one point in time.

 A lot of people are multi-lingual and I'm sure there will be at least some
 apps that need different data sorted in different ways for each language
 used.  It's quite likely that such apps could use multiple databases as a
 work-around though.  (As long as they don't need to execute transactions
 between them.)

I can give some input as a multi-lingual person here. The only time
I've used multiple languages at the same time in an application is for
spell checking. In my browser I sometimes end up with setting the
language in one textbox to swedish, and another to english. It's often
annoying how poorly this use case is supported in applications
actually.

However I've never been in a situation where I've wanted some lists
sorted in swedish and some in english. Possibly you would want to have
spelling suggestions for a swedish textbox sorted in swedish order,
and spelling suggestions for an english textbox sorted in english
order. Though I think it wouldn't be much problem to have the
different dictionaries in different databases.

From an API point of view I think it would be pretty easy to support
setting collation for individual objectStores. All we'd need is
something like:

interface IDBObjectStore {
  ...
  IDBRequest setSortingLanguage(in DOMString languageCode);
  IDBRequest getSortingLanguage();
  ...
};

To call setSortingLanguage you'd need READ_WRITE access. It acts just
like any other writing request, with the only difference that it can
take a lng time to execute. We could even add these functions to
IDBIndex to allow the same data to be sorted in different ways at the
same time.

However I think it's very rare that this will be needed. And there are
ways to somewhat work around it by using separate databases. So I
would probably say that lets keep it database-wide for now, and
reconsider in version 2.

/ Jonas



Re: [IndexedDB] Languages for collation

2010-08-13 Thread Jeremy Orlow
On Fri, Aug 13, 2010 at 5:02 PM, Jonas Sicking jo...@sicking.cc wrote:

 On Fri, Aug 13, 2010 at 4:56 AM, Jeremy Orlow jor...@chromium.org wrote:
  On Fri, Aug 13, 2010 at 1:31 AM, Pablo Castro 
 pablo.cas...@microsoft.com
  wrote:
 
  From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy
  Orlow
  Sent: Thursday, August 12, 2010 2:18 AM
 
   I think we should first break down the use cases and look at how many
   of them just need _a_ sort order, how many of them a per-database
 sort order
   is ok, and how many of them would need something finer grained (like
 a
   per-key ordering).
 
  That's reasonable. What I was thinking is that any case where you'll use
  the order of items in a store/index to display things to the user (e.g.
 a
  list of contacts) you'd want the items to be in proper order  for the
 user's
  language. That will not only match users' expectations but also match
 other
  applications (or even other parts of the UA) that display data based on
 the
  current OS language or the users' choice of language.
 
  That covers a very broad spectrum of scenarios that need
 language-specific
  sort order.
 
  I find it unlikely that a single web app will need more than one
 language
  per database (or even per origin/OS account), given that most
 applications
  operate in a single language at any one point in time.
 
  A lot of people are multi-lingual and I'm sure there will be at least
 some
  apps that need different data sorted in different ways for each language
  used.  It's quite likely that such apps could use multiple databases as a
  work-around though.  (As long as they don't need to execute transactions
  between them.)

 I can give some input as a multi-lingual person here. The only time
 I've used multiple languages at the same time in an application is for
 spell checking. In my browser I sometimes end up with setting the
 language in one textbox to swedish, and another to english. It's often
 annoying how poorly this use case is supported in applications
 actually.

 However I've never been in a situation where I've wanted some lists
 sorted in swedish and some in english. Possibly you would want to have
 spelling suggestions for a swedish textbox sorted in swedish order,
 and spelling suggestions for an english textbox sorted in english
 order. Though I think it wouldn't be much problem to have the
 different dictionaries in different databases.

 From an API point of view I think it would be pretty easy to support
 setting collation for individual objectStores. All we'd need is
 something like:

 interface IDBObjectStore {
  ...
  IDBRequest setSortingLanguage(in DOMString languageCode);
  IDBRequest getSortingLanguage();
  ...
 };

 To call setSortingLanguage you'd need READ_WRITE access. It acts just
 like any other writing request, with the only difference that it can
 take a lng time to execute. We could even add these functions to
 IDBIndex to allow the same data to be sorted in different ways at the
 same time.


Why not put it behind setVersion and just make it an optional parameter when
creating objectStores and indexes?  I agree with Pablo that these things
really shouldn't be changing much--in fact, maybe it's not worth making
them modifiable at all (without rebuilding a new objectStore/index
yourself).


 However I think it's very rare that this will be needed. And there are
 ways to somewhat work around it by using separate databases. So I
 would probably say that lets keep it database-wide for now, and
 reconsider in version 2.


On the other hand, is there any reason not to make it per-objectStore/index?
 As far as I can tell, it should actually be fairly light weight form an API
point of view: we can just add it as an optional parameter to
createObjectStore/createIndex.  From an implementation point of view, I
really don't see this being much overhead either.  So maybe we should just
do it?

The alternative is to add a function within setVersion to set the language
which actually seems less elegant.

J


Re: [IndexedDB] Languages for collation

2010-08-12 Thread Mikeal Rogers
Why not just use the unicode collation algorithm?

Then you won't have to hint the locale.

http://en.wikipedia.org/wiki/Unicode_collation_algorithm

CouchDB uses some definitions around sorting complex types like arrays and
objects but when it comes down to sorting strings it just defaults to to the
unicode collation algorithm and all the locale's are happy.

-Mikeal


On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro
pablo.cas...@microsoft.comwrote:

 We had some discussions about collation algorithms and such in the past,
 but I don't think we have settled on the language aspect of it. In order to
 have stores and indexes sort character-based keys in a way that is
 consistent with users' expectations we'll have to take indication in the API
 of what language we should use to collate strings.

 Trying to take a minimalist approach, we could add an optional parameter on
 the database open call that indicates the language to use (e.g. en or
 en-UK, etc.). If the language is not specified and the database does not
 exist, then we can use the current browser/OS language to create the
 database. If not specified and database already exists, then use the one
 it's already there (this accommodates the fact that a user may be able to
 change their default language in the browser/OS after the database has been
 created using the default). If the language is specified and the database
 already exists and the specified language is not the one the database has
 then we'll throw an exception (same behavior as with description, although
 we have that one in flight right now as well).

 We should probably also add a read-only attribute to the database object
 that exposes the language.

 If this works for folks I can write a proposal for the specific changes to
 the spec.

 Thanks
 -pablo





RE: [IndexedDB] Languages for collation

2010-08-12 Thread Pablo Castro

From: Mikeal Rogers [mailto:mikeal.rog...@gmail.com] 
Sent: Wednesday, August 11, 2010 11:35 PM

 Why not just use the unicode collation algorithm?

 Then you won't have to hint the locale.

Unless I'm missing something, the UCA defines the general algorithm for 
collating strings but you still need to know the language in order to sort 
strings properly in that language. For example, in Spanish the letters c and 
h  together (e.g. in chau (bye)) sort as a single letter, causing the 
expected sort order to be different from English where they are always two 
independent letters (e.g. so chau comes before cuando (when) when sorted in 
English, but after when sorted in Spanish).


 http://en.wikipedia.org/wiki/Unicode_collation_algorithm

 CouchDB uses some definitions around sorting complex types like arrays and 
 objects but when it comes down to sorting strings it just defaults to to the 
 unicode collation algorithm and all the locale's are happy.

 -Mikeal

 On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro pablo.cas...@microsoft.com 
 wrote:
 We had some discussions about collation algorithms and such in the past, but 
 I don't think we have settled on the language aspect of it. In order to have 
 stores and indexes sort character-based keys in a way that is consistent 
 with users' expectations we'll have to take indication in the API of what 
 language we should use to collate strings.

 Trying to take a minimalist approach, we could add an optional parameter on 
 the database open call that indicates the language to use (e.g. en or 
 en-UK, etc.). If the language is not specified and the database does not 
 exist, then we can use the current browser/OS language to create the 
 database. If not specified and database already exists, then use the one 
 it's already there (this accommodates the fact that a user may be able to 
 change their default language in the browser/OS after the database has been 
 created using the default). If the language is specified and the database 
 already exists and the specified language is not the one the database has 
 then we'll throw an exception (same behavior as with description, although 
 we have that one in flight right now as well).

 We should probably also add a read-only attribute to the database object 
 that exposes the language.

 If this works for folks I can write a proposal for the specific changes to 
 the spec.

 Thanks
 -pablo





Re: [IndexedDB] Languages for collation

2010-08-12 Thread Jeremy Orlow
On Thu, Aug 12, 2010 at 8:28 AM, Pablo Castro pablo.cas...@microsoft.comwrote:


 From: Mikeal Rogers [mailto:mikeal.rog...@gmail.com]
 Sent: Wednesday, August 11, 2010 11:35 PM

  Why not just use the unicode collation algorithm?
 
  Then you won't have to hint the locale.

 Unless I'm missing something, the UCA defines the general algorithm for
 collating strings but you still need to know the language in order to sort
 strings properly in that language. For example, in Spanish the letters c
 and h  together (e.g. in chau (bye)) sort as a single letter, causing
 the expected sort order to be different from English where they are always
 two independent letters (e.g. so chau comes before cuando (when) when
 sorted in English, but after when sorted in Spanish).

 
  http://en.wikipedia.org/wiki/Unicode_collation_algorithm
 
  CouchDB uses some definitions around sorting complex types like arrays
 and objects but when it comes down to sorting strings it just defaults to to
 the unicode collation algorithm and all the locale's are happy.
 
  -Mikeal
 
  On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro 
 pablo.cas...@microsoft.com wrote:
  We had some discussions about collation algorithms and such in the past,
 but I don't think we have settled on the language aspect of it. In order to
 have stores and indexes sort character-based keys in a way that is
 consistent with users' expectations we'll have to take indication in the API
 of what language we should use to collate strings.
 
  Trying to take a minimalist approach, we could add an optional parameter
 on the database open call that indicates the language to use (e.g. en or
 en-UK, etc.). If the language is not specified and the database does not
 exist, then we can use the current browser/OS language to create the
 database. If not specified and database already exists, then use the one
 it's already there (this accommodates the fact that a user may be able to
 change their default language in the browser/OS after the database has been
 created using the default). If the language is specified and the database
 already exists and the specified language is not the one the database has
 then we'll throw an exception (same behavior as with description, although
 we have that one in flight right now as well).
 
  We should probably also add a read-only attribute to the database object
 that exposes the language.


I think we should first break down the use cases and look at how many of
them just need _a_ sort order, how many of them a per-database sort order is
ok, and how many of them would need something finer grained (like a per-key
ordering).

Are there work-arounds for getting an UCA ordered data structure to hold
data other language's order?  For example, I could imagine it'd be possible
to do some sort of encode step on the data before insertion (and decode on
removal) that would make UCA work.  I have no idea, but if such algorithms
existed and were well understood, then it'd definitely make me lean towards
punting language specification to v2.

J


 
  If this works for folks I can write a proposal for the specific changes
 to the spec.
 
  Thanks
  -pablo






Re: [IndexedDB] Languages for collation

2010-08-12 Thread Jonas Sicking
On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro
pablo.cas...@microsoft.com wrote:
 We had some discussions about collation algorithms and such in the past, but 
 I don't think we have settled on the language aspect of it. In order to have 
 stores and indexes sort character-based keys in a way that is consistent with 
 users' expectations we'll have to take indication in the API of what language 
 we should use to collate strings.

 Trying to take a minimalist approach, we could add an optional parameter on 
 the database open call that indicates the language to use (e.g. en or 
 en-UK, etc.). If the language is not specified and the database does not 
 exist, then we can use the current browser/OS language to create the 
 database. If not specified and database already exists, then use the one it's 
 already there (this accommodates the fact that a user may be able to change 
 their default language in the browser/OS after the database has been created 
 using the default). If the language is specified and the database already 
 exists and the specified language is not the one the database has then we'll 
 throw an exception (same behavior as with description, although we have 
 that one in flight right now as well).

 We should probably also add a read-only attribute to the database object that 
 exposes the language.

 If this works for folks I can write a proposal for the specific changes to 
 the spec.

If we make it part of the database open call, then that makes it
impossible to change the sorting order of an existing database, no?
This seems like it could be a problem. I.e. it quite possible that an
application will want to allow the user to change the sorting
language, for example when changing the language of the UI.

One solution would be to allow language to be set as part of the
setVersion call.

/ Jonas



Re: [IndexedDB] Languages for collation

2010-08-12 Thread Jeremy Orlow
On Thu, Aug 12, 2010 at 11:19 AM, Jonas Sicking jo...@sicking.cc wrote:

 On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro
 pablo.cas...@microsoft.com wrote:
  We had some discussions about collation algorithms and such in the past,
 but I don't think we have settled on the language aspect of it. In order to
 have stores and indexes sort character-based keys in a way that is
 consistent with users' expectations we'll have to take indication in the API
 of what language we should use to collate strings.
 
  Trying to take a minimalist approach, we could add an optional parameter
 on the database open call that indicates the language to use (e.g. en or
 en-UK, etc.). If the language is not specified and the database does not
 exist, then we can use the current browser/OS language to create the
 database. If not specified and database already exists, then use the one
 it's already there (this accommodates the fact that a user may be able to
 change their default language in the browser/OS after the database has been
 created using the default). If the language is specified and the database
 already exists and the specified language is not the one the database has
 then we'll throw an exception (same behavior as with description, although
 we have that one in flight right now as well).
 
  We should probably also add a read-only attribute to the database object
 that exposes the language.
 
  If this works for folks I can write a proposal for the specific changes
 to the spec.

 If we make it part of the database open call, then that makes it
 impossible to change the sorting order of an existing database, no?
 This seems like it could be a problem. I.e. it quite possible that an
 application will want to allow the user to change the sorting
 language, for example when changing the language of the UI.

 One solution would be to allow language to be set as part of the
 setVersion call.


Whether it's per-database or more fine grained I think it absolutely must be
part of setVersion.  Changing the language will be a very heavyweight
operation that'll require a similar level of isolation to schema changes
of the database.  (Not sure how I missed this point of Pablo's original
email.)

J


RE: [IndexedDB] Languages for collation

2010-08-12 Thread Pablo Castro

From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy Orlow
Sent: Thursday, August 12, 2010 2:18 AM

 I think we should first break down the use cases and look at how many of 
 them just need _a_ sort order, how many of them a per-database sort order is 
 ok, and how many of them would need something finer grained (like a per-key 
 ordering).

That's reasonable. What I was thinking is that any case where you'll use the 
order of items in a store/index to display things to the user (e.g. a list of 
contacts) you'd want the items to be in proper order  for the user's language. 
That will not only match users' expectations but also match other applications 
(or even other parts of the UA) that display data based on the current OS 
language or the users' choice of language. 

That covers a very broad spectrum of scenarios that need language-specific sort 
order. 

I find it unlikely that a single web app will need more than one language per 
database (or even per origin/OS account), given that most applications operate 
in a single language at any one point in time. 

 Are there work-arounds for getting an UCA ordered data structure to hold 
 data other language's order?  For example, I could imagine it'd be possible 
 to do some sort of encode step on the data before insertion (and decode on 
 removal) that would make UCA work.  I have no idea, but if such algorithms 
 existed and were well understood, then it'd definitely make me lean towards 
 punting language specification to v2.

I'm not sure I understand this paragraph. UCA ordered may not mean much more 
than just ordering using a binary collation if the language is not specified. 
While this is typically not an issue in English, in other languages this 
introduces a varying level of deviation from users' expectations. Given that 
different languages have conflicting rules for collation, I'm not sure how this 
can be generalized independently of the language. Even in the UCA specification 
[1] the aspect of input language is mentioned as the most important feature of 
collation.

[1] http://www.unicode.org/reports/tr10/




RE: [IndexedDB] Languages for collation

2010-08-12 Thread Pablo Castro

From: jor...@google.com [mailto:jor...@google.com] On Behalf Of Jeremy Orlow
Sent: Thursday, August 12, 2010 3:36 AM

 On Thu, Aug 12, 2010 at 11:19 AM, Jonas Sicking jo...@sicking.cc wrote:
 On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro
 pablo.cas...@microsoft.com wrote:
  We had some discussions about collation algorithms and such in the past, 
  but I don't think we have settled on the language aspect of it. In order 
  to have stores and indexes sort character-based keys in a way that is 
  consistent with users' expectations we'll have to take indication in the 
  API of what language we should use to collate strings.
 
  Trying to take a minimalist approach, we could add an optional parameter 
  on the database open call that indicates the language to use (e.g. en or 
  en-UK, etc.). If the language is not specified and the database does not 
  exist, then we can use the current browser/OS language to create the 
  database. If not specified and database already exists, then use the one 
  it's already there (this accommodates the fact that a user may be able to 
  change their default language in the browser/OS after the database has 
  been created using the default). If the language is specified and the 
  database already exists and the specified language is not the one the 
  database has then we'll throw an exception (same behavior as with 
  description, although we have that one in flight right now as well).
 
  We should probably also add a read-only attribute to the database object 
  that exposes the language.
 
  If this works for folks I can write a proposal for the specific changes to 
  the spec.
 If we make it part of the database open call, then that makes it
 impossible to change the sorting order of an existing database, no?
 This seems like it could be a problem. I.e. it quite possible that an
 application will want to allow the user to change the sorting
 language, for example when changing the language of the UI.

 One solution would be to allow language to be set as part of the
 setVersion call.

 Whether it's per-database or more fine grained I think it absolutely must be 
 part of setVersion.  Changing the language will be a very heavyweight 
 operation that'll require a similar level of isolation to schema changes 
 of the database.  (Not sure how I missed this point of Pablo's original 
 email.)

Yes, changing the collection would effectively mean re-creating all the stores 
and indexes. At a very minimum it needs to be a setVersion thing. I also don't 
think it would be too crazy to not support changing collations period. In the 
unusual case where a user must absolutely do this, it can be done by creating a 
separate database and copying the data over using the APIs.