RE: [IndexedDB] Closing on bug 9903 (collations)

2011-06-17 Thread Pablo Castro

From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On 
Behalf Of Keean Schupke
Sent: Tuesday, May 31, 2011 11:51 PM

 On 1 June 2011 01:37, Pablo Castro pablo.cas...@microsoft.com wrote:

 -Original Message-
 From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh 
 Gregor
 Sent: Tuesday, May 31, 2011 3:49 PM

  On Tue, May 31, 2011 at 6:39 PM, Pablo Castro
  pablo.cas...@microsoft.com wrote:
   No, that was poor wording on my part, I keep using locale in the 
   wrong context. I meant to have the API take a proper collation 
   identifier. The identifier can be as specific as the caller wants it to 
   be. The implementation could choose to not honor some specific detail 
   if it can't handle it (to the extent that doing so is allowed by the 
   specification of collation names), or fail because it considers that 
   not handling a particular aspect of the collation identifier would 
   severely deviate from the caller's expectations.
 
  I'm not sure I understand you.  My personal opinion is that there
  should be no undefined behavior here.  If authors are allowed to pass
  collation identifiers, the spec needs to say exactly how they're to be
  interpreted, so the same identifier passed to two different browsers
  will result in the same collation, i.e., the same strings need to sort
  the same cross-browser.  Having only binary collation is better than
  having non-binary collations but not defining them, IMO.
 I thought BCP47 allowed implementations to drop subtags if needed. I just 
 re-read the spec and it seems that it only allows to do that in constrained 
 cases where you can't fit the whole name in your buffer (which wouldn't 
 apply to the context discussed here). My first instinct is that this is 
 quite a bit to guarantee (full consistency in collation), but it seems that 
 that's what the spec is shooting for.

   Given the amount of debate on this, could we at least agree that we can 
   do binary for v1? We can then have an open item for v2 on taking 
   collation names and sort according to UCA or taking callbacks and such.
 
  I'm okay with supporting only binary to start with.
 Great. I'll still wait a bit to see what other folks think, and then update 
 the bug one way or the other.

 Thanks
 -pablo

 The discussion sounds like it is headed in the right direction. Are there 
 any issues with non-unicode encodings that need to be dealt with (HTTP 
 headers default to ISO-8859 I think). Would people be expected to convert on 
 read into UTF-16 strings or use typed-arrays?

I asked around here and folks actually pointed out that the JavaScript spec 
seems to be describing exactly what we needed. Looking at here [1], section 
11.8.5, the relevant fragment starting at step 4 goes:

Else, both px and py are Strings
a. If py is a prefix of px, return false. (A String value p is a prefix of 
String value q if q can be the result of concatenating p and some other String 
r. Note that any String is a prefix of itself, because r may be the empty 
String.)
b. If px is a prefix of py, return true.
c. Let k be the smallest nonnegative integer such that the character at 
position k within px is different from the character at position k within py. 
(There must be such a k, for neither String is a prefix of the other.)
d. Let m be the integer that is the code unit value for the character at 
position k within px.
e. Let n be the integer that is the code unit value for the character at 
position k within py.
f. If m  n, return true. Otherwise, return false.

It also has a note below indicating:

NOTE 2 The comparison of Strings uses a simple lexicographic ordering on 
sequences of code unit values. There is no attempt to use the more complex, 
semantically oriented definitions of character or string equality and collating 
order defined in the Unicode specification. Therefore String values that are 
canonically equal according to the Unicode standard could test as unequal. In 
effect this algorithm assumes that both Strings are already in normalised form. 
Also, note that for strings containing supplementary characters, lexicographic 
ordering on sequences of UTF-16 code unit values differs from that on sequences 
of code point values.

Which is very much in line with what we've been discussing, and has the extra 
feature of being compatible with JavaScript order. 

So it looks like we could reference (or inline) this in the spec and have a 
fully specified order for keys with string content.

Thoughts? 

Thanks
-pablo

[1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf





Re: [IndexedDB] Closing on bug 9903 (collations)

2011-06-17 Thread Jonas Sicking
On Fri, Jun 17, 2011 at 11:43 AM, Pablo Castro
pablo.cas...@microsoft.com wrote:

 From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On 
 Behalf Of Keean Schupke
 Sent: Tuesday, May 31, 2011 11:51 PM

 On 1 June 2011 01:37, Pablo Castro pablo.cas...@microsoft.com wrote:

 -Original Message-
 From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh 
 Gregor
 Sent: Tuesday, May 31, 2011 3:49 PM

  On Tue, May 31, 2011 at 6:39 PM, Pablo Castro
  pablo.cas...@microsoft.com wrote:
   No, that was poor wording on my part, I keep using locale in the 
   wrong context. I meant to have the API take a proper collation 
   identifier. The identifier can be as specific as the caller wants it 
   to be. The implementation could choose to not honor some specific 
   detail if it can't handle it (to the extent that doing so is allowed 
   by the specification of collation names), or fail because it considers 
   that not handling a particular aspect of the collation identifier 
   would severely deviate from the caller's expectations.
 
  I'm not sure I understand you.  My personal opinion is that there
  should be no undefined behavior here.  If authors are allowed to pass
  collation identifiers, the spec needs to say exactly how they're to be
  interpreted, so the same identifier passed to two different browsers
  will result in the same collation, i.e., the same strings need to sort
  the same cross-browser.  Having only binary collation is better than
  having non-binary collations but not defining them, IMO.
 I thought BCP47 allowed implementations to drop subtags if needed. I just 
 re-read the spec and it seems that it only allows to do that in constrained 
 cases where you can't fit the whole name in your buffer (which wouldn't 
 apply to the context discussed here). My first instinct is that this is 
 quite a bit to guarantee (full consistency in collation), but it seems that 
 that's what the spec is shooting for.

   Given the amount of debate on this, could we at least agree that we 
   can do binary for v1? We can then have an open item for v2 on taking 
   collation names and sort according to UCA or taking callbacks and such.
 
  I'm okay with supporting only binary to start with.
 Great. I'll still wait a bit to see what other folks think, and then update 
 the bug one way or the other.

 Thanks
 -pablo

 The discussion sounds like it is headed in the right direction. Are there 
 any issues with non-unicode encodings that need to be dealt with (HTTP 
 headers default to ISO-8859 I think). Would people be expected to convert 
 on read into UTF-16 strings or use typed-arrays?

 I asked around here and folks actually pointed out that the JavaScript spec 
 seems to be describing exactly what we needed. Looking at here [1], section 
 11.8.5, the relevant fragment starting at step 4 goes:

 Else, both px and py are Strings
    a. If py is a prefix of px, return false. (A String value p is a prefix of 
 String value q if q can be the result of concatenating p and some other 
 String r. Note that any String is a prefix of itself, because r may be the 
 empty String.)
    b. If px is a prefix of py, return true.
    c. Let k be the smallest nonnegative integer such that the character at 
 position k within px is different from the character at position k within py. 
 (There must be such a k, for neither String is a prefix of the other.)
    d. Let m be the integer that is the code unit value for the character at 
 position k within px.
    e. Let n be the integer that is the code unit value for the character at 
 position k within py.
    f. If m  n, return true. Otherwise, return false.

 It also has a note below indicating:

 NOTE 2 The comparison of Strings uses a simple lexicographic ordering on 
 sequences of code unit values. There is no attempt to use the more complex, 
 semantically oriented definitions of character or string equality and 
 collating order defined in the Unicode specification. Therefore String values 
 that are canonically equal according to the Unicode standard could test as 
 unequal. In effect this algorithm assumes that both Strings are already in 
 normalised form. Also, note that for strings containing supplementary 
 characters, lexicographic ordering on sequences of UTF-16 code unit values 
 differs from that on sequences of code point values.

 Which is very much in line with what we've been discussing, and has the extra 
 feature of being compatible with JavaScript order.

 So it looks like we could reference (or inline) this in the spec and have a 
 fully specified order for keys with string content.

 Thoughts?

Sounds great! Thanks for doing the research here!

/ Jonas



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-06-01 Thread Keean Schupke
On 1 June 2011 01:37, Pablo Castro pablo.cas...@microsoft.com wrote:


 -Original Message-
 From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of
 Aryeh Gregor
 Sent: Tuesday, May 31, 2011 3:49 PM

  On Tue, May 31, 2011 at 6:39 PM, Pablo Castro
  pablo.cas...@microsoft.com wrote:
   No, that was poor wording on my part, I keep using locale in the
 wrong context. I meant to have the API take a proper collation identifier.
 The identifier can be as specific as the caller wants it to be. The
 implementation could choose to not honor some specific detail if it can't
 handle it (to the extent that doing so is allowed by the specification of
 collation names), or fail because it considers that not handling a
 particular aspect of the collation identifier would severely deviate from
 the caller's expectations.
 
  I'm not sure I understand you.  My personal opinion is that there
  should be no undefined behavior here.  If authors are allowed to pass
  collation identifiers, the spec needs to say exactly how they're to be
  interpreted, so the same identifier passed to two different browsers
  will result in the same collation, i.e., the same strings need to sort
  the same cross-browser.  Having only binary collation is better than
  having non-binary collations but not defining them, IMO.

 I thought BCP47 allowed implementations to drop subtags if needed. I just
 re-read the spec and it seems that it only allows to do that in constrained
 cases where you can't fit the whole name in your buffer (which wouldn't
 apply to the context discussed here). My first instinct is that this is
 quite a bit to guarantee (full consistency in collation), but it seems that
 that's what the spec is shooting for.

   Given the amount of debate on this, could we at least agree that we
 can do binary for v1? We can then have an open item for v2 on taking
 collation names and sort according to UCA or taking callbacks and such.
 
  I'm okay with supporting only binary to start with.

 Great. I'll still wait a bit to see what other folks think, and then update
 the bug one way or the other.

 Thanks
 -pablo


The discussion sounds like it is headed in the right direction. Are there
any issues with non-unicode encodings that need to be dealt with (HTTP
headers default to ISO-8859 I think). Would people be expected to convert on
read into UTF-16 strings or use typed-arrays?


Cheers,
Keean.


RE: [IndexedDB] Closing on bug 9903 (collations)

2011-05-31 Thread Pablo Castro

-Original Message-
From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh 
Gregor
Sent: Friday, May 06, 2011 10:05 AM


 On Fri, May 6, 2011 at 5:18 AM, Jonas Sicking jo...@sicking.cc wrote:
  Based on that, my conclusion is that we should go with what Pablo is
  proposing. And I think we should do it for v1.

 If I understand correctly, Pablo's proposal is that the author be
 allowed to specify a locale, and the browser can collate in some
 undefined way based on that locale.  That sounds like a really bad
 idea for interop.  If non-binary collation is supported in a first
 version, it should be either

No, that was poor wording on my part, I keep using locale in the wrong 
context. I meant to have the API take a proper collation identifier. The 
identifier can be as specific as the caller wants it to be. The implementation 
could choose to not honor some specific detail if it can't handle it (to the 
extent that doing so is allowed by the specification of collation names), or 
fail because it considers that not handling a particular aspect of the 
collation identifier would severely deviate from the caller's expectations.

 1) Two choices, binary or UCA 6.0.0.  (AFAIK, UCA gives fairly good
 results for most languages even without tailoring, so it might be just
 fine for v1.  It's vastly better than binary, for sure.)

Given the amount of debate on this, could we at least agree that we can do 
binary for v1? We can then have an open item for v2 on taking collation names 
and sort according to UCA or taking callbacks and such.

 2) In addition to binary and UCA 6.0.0, allow UCA 6.0.0 tailored by
 any of the locales defined by CLDR 1.9.1.

 There also needs to be some thought put into how to handle version
 updates, since browsers cannot update their UCA or CLDR implementation
 without rebuilding all existing indexes that used it (unless they keep
 the old implementation forever).  It might be that browsers should
 just stick to a fixed version for the time being (like 6.0.0 and
 1.9.1), and we might decide that no further APIs are needed now to
 accommodate possible future switches, but at least some thought needs
 to be given to it.

I wonder if the API (independently of when we get to this) should include the 
version either as part of the collation identifier or as a separate argument. 
This would allow UAs to support a version or two for a while, and then phase 
them out as they fall out of use in favor of newer ones.

 On consideration, I don't think user-specified sortkey functions are
 necessary at this stage.  If collations are to be identified by
 strings for now, we could always overload the value to accept a
 function at some later date if we wanted to support that.  So I
 wouldn't worry about that further.

I agree.

-pablo



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-31 Thread Aryeh Gregor
On Tue, May 31, 2011 at 6:39 PM, Pablo Castro
pablo.cas...@microsoft.com wrote:
 No, that was poor wording on my part, I keep using locale in the wrong 
 context. I meant to have the API take a proper collation identifier. The 
 identifier can be as specific as the caller wants it to be. The 
 implementation could choose to not honor some specific detail if it can't 
 handle it (to the extent that doing so is allowed by the specification of 
 collation names), or fail because it considers that not handling a particular 
 aspect of the collation identifier would severely deviate from the caller's 
 expectations.

I'm not sure I understand you.  My personal opinion is that there
should be no undefined behavior here.  If authors are allowed to pass
collation identifiers, the spec needs to say exactly how they're to be
interpreted, so the same identifier passed to two different browsers
will result in the same collation, i.e., the same strings need to sort
the same cross-browser.  Having only binary collation is better than
having non-binary collations but not defining them, IMO.

 Given the amount of debate on this, could we at least agree that we can do 
 binary for v1? We can then have an open item for v2 on taking collation names 
 and sort according to UCA or taking callbacks and such.

I'm okay with supporting only binary to start with.



RE: [IndexedDB] Closing on bug 9903 (collations)

2011-05-31 Thread Pablo Castro

-Original Message-
From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh 
Gregor
Sent: Tuesday, May 31, 2011 3:49 PM

 On Tue, May 31, 2011 at 6:39 PM, Pablo Castro
 pablo.cas...@microsoft.com wrote:
  No, that was poor wording on my part, I keep using locale in the wrong 
  context. I meant to have the API take a proper collation identifier. The 
  identifier can be as specific as the caller wants it to be. The 
  implementation could choose to not honor some specific detail if it can't 
  handle it (to the extent that doing so is allowed by the specification of 
  collation names), or fail because it considers that not handling a 
  particular aspect of the collation identifier would severely deviate from 
  the caller's expectations.

 I'm not sure I understand you.  My personal opinion is that there
 should be no undefined behavior here.  If authors are allowed to pass
 collation identifiers, the spec needs to say exactly how they're to be
 interpreted, so the same identifier passed to two different browsers
 will result in the same collation, i.e., the same strings need to sort
 the same cross-browser.  Having only binary collation is better than
 having non-binary collations but not defining them, IMO.

I thought BCP47 allowed implementations to drop subtags if needed. I just 
re-read the spec and it seems that it only allows to do that in constrained 
cases where you can't fit the whole name in your buffer (which wouldn't apply 
to the context discussed here). My first instinct is that this is quite a bit 
to guarantee (full consistency in collation), but it seems that that's what the 
spec is shooting for. 

  Given the amount of debate on this, could we at least agree that we can do 
  binary for v1? We can then have an open item for v2 on taking collation 
  names and sort according to UCA or taking callbacks and such.

 I'm okay with supporting only binary to start with.

Great. I'll still wait a bit to see what other folks think, and then update the 
bug one way or the other.

Thanks
-pablo



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-09 Thread Shawn Wilsher

On 5/6/2011 7:07 AM, timeless wrote:

I think that a stored procedure could be considered as a compiled
version of a serialized function. i.e. something which loses its scope
chain, and which loses access to its parent object. If it loses access
to its scope chain which includes the interesting globals, it will no
longer have access to fun things like DOM objects, roughly like
DOMWorkers but with even less exciting objects available. I'd hope
that a jit should be able to do a fairly reasonable job of optimizing
such a function given these constraints.

This may be what we go with, but not in version 1.

Cheers,

Shawn



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-06 Thread Keean Schupke
On 6 May 2011 03:00, Jonas Sicking jo...@sicking.cc wrote:

 On Wed, May 4, 2011 at 11:12 PM, Keean Schupke ke...@fry-it.com wrote:
  On 5 May 2011 00:33, Aryeh Gregor simetrical+...@gmail.com wrote:
 
  On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc wrote:
   I don't think we should do callbacks for the first version of
   javascript. It gets very messy since we can't rely on that the script
   function will be returning stable values.
 
  The worst that would happen if it didn't return stable values is that
  sorting would return unpredictable results.
 
  Worst is an infinite loop - no return.
 
 
   So the choice here really is between only supporting some form of
   binary sorting, or supporting a built-in set of collations. Anything
   else will have to wait for version 2 in my opinion.
 
  I think it would be a mistake to try supporting a limited set of
  natural-language collations.  Binary collation is fine for a first
  version.  MySQL only supported binary collation up through version 4,
  for instance.
 
  A good point about MySQL.
 
 
  On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com wrote:
   I thought only the app that created the db could open it (for security
   reasons)... so it becomes the app's responsibility to do version
   control.
   The comparison function is not going to change by itself - someone has
   to go
   into the code and change it, when they do that they should up the
   revision
   of the database, if that change is incompatible.
 
  Why should we let such a pitfall exist if we can just store the
  function and avoid the issue?
 
  I don't see it as a pitfall, it is an has the advantage of transparency.
 
 
   There is exactly the same problem with object properties. If the app
   changes
   to expect a new property on all objects stored, then the app has to
   correctly deal with the update.
 
  If a requested property doesn't exist, I assume the API will fail
  immediately with a clear error code.  It will not fail silently and
  mysteriously with no error code.  (Again, I haven't looked at it
  closely, or tried to use it.)
 
  What if the new version uses the same property name for a different
 thing?
  For example in V1 'Employer' is a string name, and in V2 'Employer' is a
  reference to another object. You may say 'you should change the column
  name'? Right thats just the same as me saying you should change the DB
  version number when you change the collation algorithm. Its the same
 thing.
  People seem to be making a big fuss about having a non-persisted
 collation
  function defined in user code, when many many things require the code to
  have the correct model of the data stored in the database to work
 properly.
  It seems illogical to make a special case for this function, and not do
  anything about all the other cases. IMHO either the database should have
 a
  stored schema, or it should not. If IndexedDB is going the direction of
 not
  having a stored schema, then the designers should have the confidence in
  their decision to stick with it and at least produce something with a
  consistent approach to the problem.
 
 
   2) making things easy for the user - for me a simpler more predictable
   API
   is better for the user. Having a function stored inside the database
 is
   bad,
   because you cannot see what function might be stored in there...
 
  We could let you query the stored function.
 
  Why would you need to read it. Every time you open the database you would
  need to check the function is the one you expect. The code would have to
  contain the function so it can compare it with the one in the DB and
 update
  it if necessary. If the code contains the function there are two copies
 of
  the function, one in the database and one in the code? which one is
 correct?
  which one is it using? So sometimes you will write the new function to
 the
  database, and sometimes you will not? More paths to test in code
 coverage,
  more complexity. Its simpler to just always set the function when opening
  the database.
 
 
   it might be
   a function from a previous version of the code and cause all sorts of
   strange bugs (which will only affect certain users with a certain
   version of
   the function stored in their DB).
 
  It will cause *much* less strange bugs than if you have one index that
  used two different collations, which is the alternative possibility.
  If the function is stored, the worst case will be that the collation
  function is out of date.  In practice, authors will mostly want to use
  established collation functions like UCA and won't mind if they're out
  of date.  They'll also only very rarely have occasion to deliberately
  change the function.
 
  As I said, you will end up querying the function to see if it is the one
 you
  want to use, if you do that you may as well set it every time.
  Thinking about this a bit more. If you change the collation function you
  need to re-sort the 

Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-06 Thread Keean Schupke
On 6 May 2011 00:22, Aryeh Gregor simetrical+...@gmail.com wrote:

 On Thu, May 5, 2011 at 2:12 AM, Keean Schupke ke...@fry-it.com wrote:
  What if the new version uses the same property name for a different
 thing?

 Yes, obviously it's going to be possible for code changes to cause
 hard-to-catch bugs due to not updating the database correctly.  We
 don't have to add more cases where that's possible than necessary,
 without good reason.  Maybe there's good reason here, but the added
 potential for error can't be neglected as a cost.


I have seen many bugs in real databases due to stored procedures.



  Why would you need to read it. Every time you open the database you would
  need to check the function is the one you expect.

 Not if you never intend to change it, or don't care if it's outdated.
 I expect this to be the most common case.


People don't change the language setting in an application?



 Consider the case of someone using CLDR-tailored UCA and a new version
 comes out.  You want to use the newest version for new indexes, if
 multiple versions are available, but there's no pressing need to
 automatically update existing indexes.  The old version is almost
 certainly good enough, unless your users use obscure languages.  So in
 my scheme, you can just update the function in your code and do
 nothing else.  In your scheme, you'd have to either stick to the old
 version across the board, or include both versions in your code
 indefinitely and include out-of-band logic to choose between them, or
 write a script that rebuilds the whole index on update (which would
 take a long time for a large index).


At least then the logic to chose between collations is visible in the code,
rather than hidden. This is all about transparency and making sure the
programmer has control of what is happening, rather than locking them into
limiting patterns, and giving them the ability to see exactly what the code
will do by reading and code-reviewing it.

With a stored procedure, what happens when a function you call (that is not
stored) changes?

The only way to be sure is to run a validation check in the index (run from
beginning to end checking the order is consistent with the comparison
function). That is the same whether you use stores procedures or not.



  The code would have to
  contain the function so it can compare it with the one in the DB and
 update
  it if necessary. If the code contains the function there are two copies
 of
  the function, one in the database and one in the code? which one is
 correct?
  which one is it using? So sometimes you will write the new function to
 the
  database, and sometimes you will not? More paths to test in code
 coverage,
  more complexity. Its simpler to just always set the function when opening
  the database.

 If the collation function is stored in the database, then I'd expect
 setting the function to rebuild the index if the new and old functions
 differ.  This could happen as a background operation, with the
 existing index still usable (with the old collation function) in the
 meantime.  So if you always wanted collations up-to-date, in my scheme
 authors could just set the function every time they open the database,
 as with your scheme.  But this could trigger a silent rebuild whenever
 necessary, so the author doesn't have to worry about it.  In your
 scheme, the author has to do the rebuild himself, and if he gets it
 wrong, the index will be corrupted.

 So as I see it, my approach is easier to use across the board.  It
 lets you not update collations on old tables without requiring you to
 keep track of multiple collation function versions, and it also
 potentially lets you update collations on old tables to the latest
 versions with rebuilding done for you in the background.  Critically,
 it does not let you change a sort function without rebuilding, since
 that will always cause bugs and you never want to do it (to a first
 approximation).

 Of course, maybe an initial implementation wouldn't do rebuilds for
 you, to keep it simple.  Then the collation function would be
 immutable after index creation, so you'd still have to do rebuilds
 yourself.  But it would still be easier and safer: the old index will
 still work in the interim even if you don't have the old version of
 your collation function around, and you can't mess up and get a
 corrupted index.

  Thinking about this a bit more. If you change the collation function you
  need to re-sort the index to make sure it will work (and avoid those
 strange
  bugs). Storing the function in the DB enables you to compare the function
  and only change it when you need to, thus optimising the number of
 re-sorts.
  That is the _only_ advantage to storing the function - as you still need
 to
  check the function stored is the one you expect to guarantee your code
 will
  run properly. So with a non-persisted function we need to sort every time
 we
  open to make sure the order is correct.

 And 

Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-06 Thread Jonas Sicking
On Thu, May 5, 2011 at 11:36 PM, Keean Schupke ke...@fry-it.com wrote:
 On 6 May 2011 03:00, Jonas Sicking jo...@sicking.cc wrote:

 On Wed, May 4, 2011 at 11:12 PM, Keean Schupke ke...@fry-it.com wrote:
  On 5 May 2011 00:33, Aryeh Gregor simetrical+...@gmail.com wrote:
 
  On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc wrote:
   I don't think we should do callbacks for the first version of
   javascript. It gets very messy since we can't rely on that the script
   function will be returning stable values.
 
  The worst that would happen if it didn't return stable values is that
  sorting would return unpredictable results.
 
  Worst is an infinite loop - no return.
 
 
   So the choice here really is between only supporting some form of
   binary sorting, or supporting a built-in set of collations. Anything
   else will have to wait for version 2 in my opinion.
 
  I think it would be a mistake to try supporting a limited set of
  natural-language collations.  Binary collation is fine for a first
  version.  MySQL only supported binary collation up through version 4,
  for instance.
 
  A good point about MySQL.
 
 
  On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com wrote:
   I thought only the app that created the db could open it (for
   security
   reasons)... so it becomes the app's responsibility to do version
   control.
   The comparison function is not going to change by itself - someone
   has
   to go
   into the code and change it, when they do that they should up the
   revision
   of the database, if that change is incompatible.
 
  Why should we let such a pitfall exist if we can just store the
  function and avoid the issue?
 
  I don't see it as a pitfall, it is an has the advantage of transparency.
 
 
   There is exactly the same problem with object properties. If the app
   changes
   to expect a new property on all objects stored, then the app has to
   correctly deal with the update.
 
  If a requested property doesn't exist, I assume the API will fail
  immediately with a clear error code.  It will not fail silently and
  mysteriously with no error code.  (Again, I haven't looked at it
  closely, or tried to use it.)
 
  What if the new version uses the same property name for a different
  thing?
  For example in V1 'Employer' is a string name, and in V2 'Employer' is a
  reference to another object. You may say 'you should change the column
  name'? Right thats just the same as me saying you should change the DB
  version number when you change the collation algorithm. Its the same
  thing.
  People seem to be making a big fuss about having a non-persisted
  collation
  function defined in user code, when many many things require the code to
  have the correct model of the data stored in the database to work
  properly.
  It seems illogical to make a special case for this function, and not do
  anything about all the other cases. IMHO either the database should have
  a
  stored schema, or it should not. If IndexedDB is going the direction of
  not
  having a stored schema, then the designers should have the confidence in
  their decision to stick with it and at least produce something with a
  consistent approach to the problem.
 
 
   2) making things easy for the user - for me a simpler more
   predictable
   API
   is better for the user. Having a function stored inside the database
   is
   bad,
   because you cannot see what function might be stored in there...
 
  We could let you query the stored function.
 
  Why would you need to read it. Every time you open the database you
  would
  need to check the function is the one you expect. The code would have to
  contain the function so it can compare it with the one in the DB and
  update
  it if necessary. If the code contains the function there are two copies
  of
  the function, one in the database and one in the code? which one is
  correct?
  which one is it using? So sometimes you will write the new function to
  the
  database, and sometimes you will not? More paths to test in code
  coverage,
  more complexity. Its simpler to just always set the function when
  opening
  the database.
 
 
   it might be
   a function from a previous version of the code and cause all sorts of
   strange bugs (which will only affect certain users with a certain
   version of
   the function stored in their DB).
 
  It will cause *much* less strange bugs than if you have one index that
  used two different collations, which is the alternative possibility.
  If the function is stored, the worst case will be that the collation
  function is out of date.  In practice, authors will mostly want to use
  established collation functions like UCA and won't mind if they're out
  of date.  They'll also only very rarely have occasion to deliberately
  change the function.
 
  As I said, you will end up querying the function to see if it is the one
  you
  want to use, if you do that you may as well set it every time.
 

Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-06 Thread Keean Schupke
On 6 May 2011 10:18, Jonas Sicking jo...@sicking.cc wrote:

 On Thu, May 5, 2011 at 11:36 PM, Keean Schupke ke...@fry-it.com wrote:
  On 6 May 2011 03:00, Jonas Sicking jo...@sicking.cc wrote:
 
  On Wed, May 4, 2011 at 11:12 PM, Keean Schupke ke...@fry-it.com
 wrote:
   On 5 May 2011 00:33, Aryeh Gregor simetrical+...@gmail.com wrote:
  
   On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc
 wrote:
I don't think we should do callbacks for the first version of
javascript. It gets very messy since we can't rely on that the
 script
function will be returning stable values.
  
   The worst that would happen if it didn't return stable values is that
   sorting would return unpredictable results.
  
   Worst is an infinite loop - no return.
  
  
So the choice here really is between only supporting some form of
binary sorting, or supporting a built-in set of collations.
 Anything
else will have to wait for version 2 in my opinion.
  
   I think it would be a mistake to try supporting a limited set of
   natural-language collations.  Binary collation is fine for a first
   version.  MySQL only supported binary collation up through version 4,
   for instance.
  
   A good point about MySQL.
  
  
   On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com
 wrote:
I thought only the app that created the db could open it (for
security
reasons)... so it becomes the app's responsibility to do version
control.
The comparison function is not going to change by itself - someone
has
to go
into the code and change it, when they do that they should up the
revision
of the database, if that change is incompatible.
  
   Why should we let such a pitfall exist if we can just store the
   function and avoid the issue?
  
   I don't see it as a pitfall, it is an has the advantage of
 transparency.
  
  
There is exactly the same problem with object properties. If the
 app
changes
to expect a new property on all objects stored, then the app has to
correctly deal with the update.
  
   If a requested property doesn't exist, I assume the API will fail
   immediately with a clear error code.  It will not fail silently and
   mysteriously with no error code.  (Again, I haven't looked at it
   closely, or tried to use it.)
  
   What if the new version uses the same property name for a different
   thing?
   For example in V1 'Employer' is a string name, and in V2 'Employer' is
 a
   reference to another object. You may say 'you should change the column
   name'? Right thats just the same as me saying you should change the DB
   version number when you change the collation algorithm. Its the same
   thing.
   People seem to be making a big fuss about having a non-persisted
   collation
   function defined in user code, when many many things require the code
 to
   have the correct model of the data stored in the database to work
   properly.
   It seems illogical to make a special case for this function, and not
 do
   anything about all the other cases. IMHO either the database should
 have
   a
   stored schema, or it should not. If IndexedDB is going the direction
 of
   not
   having a stored schema, then the designers should have the confidence
 in
   their decision to stick with it and at least produce something with a
   consistent approach to the problem.
  
  
2) making things easy for the user - for me a simpler more
predictable
API
is better for the user. Having a function stored inside the
 database
is
bad,
because you cannot see what function might be stored in there...
  
   We could let you query the stored function.
  
   Why would you need to read it. Every time you open the database you
   would
   need to check the function is the one you expect. The code would have
 to
   contain the function so it can compare it with the one in the DB and
   update
   it if necessary. If the code contains the function there are two
 copies
   of
   the function, one in the database and one in the code? which one is
   correct?
   which one is it using? So sometimes you will write the new function to
   the
   database, and sometimes you will not? More paths to test in code
   coverage,
   more complexity. Its simpler to just always set the function when
   opening
   the database.
  
  
it might be
a function from a previous version of the code and cause all sorts
 of
strange bugs (which will only affect certain users with a certain
version of
the function stored in their DB).
  
   It will cause *much* less strange bugs than if you have one index
 that
   used two different collations, which is the alternative possibility.
   If the function is stored, the worst case will be that the collation
   function is out of date.  In practice, authors will mostly want to
 use
   established collation functions like UCA and won't mind if they're
 out
   of date.  They'll also only very rarely 

Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-06 Thread Jonas Sicking
On Fri, May 6, 2011 at 4:09 AM, Keean Schupke ke...@fry-it.com wrote:
 On 6 May 2011 10:18, Jonas Sicking jo...@sicking.cc wrote:

 On Thu, May 5, 2011 at 11:36 PM, Keean Schupke ke...@fry-it.com wrote:
  On 6 May 2011 03:00, Jonas Sicking jo...@sicking.cc wrote:
 
  On Wed, May 4, 2011 at 11:12 PM, Keean Schupke ke...@fry-it.com
  wrote:
   On 5 May 2011 00:33, Aryeh Gregor simetrical+...@gmail.com wrote:
  
   On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc
   wrote:
I don't think we should do callbacks for the first version of
javascript. It gets very messy since we can't rely on that the
script
function will be returning stable values.
  
   The worst that would happen if it didn't return stable values is
   that
   sorting would return unpredictable results.
  
   Worst is an infinite loop - no return.
  
  
So the choice here really is between only supporting some form of
binary sorting, or supporting a built-in set of collations.
Anything
else will have to wait for version 2 in my opinion.
  
   I think it would be a mistake to try supporting a limited set of
   natural-language collations.  Binary collation is fine for a first
   version.  MySQL only supported binary collation up through version
   4,
   for instance.
  
   A good point about MySQL.
  
  
   On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com
   wrote:
I thought only the app that created the db could open it (for
security
reasons)... so it becomes the app's responsibility to do version
control.
The comparison function is not going to change by itself - someone
has
to go
into the code and change it, when they do that they should up the
revision
of the database, if that change is incompatible.
  
   Why should we let such a pitfall exist if we can just store the
   function and avoid the issue?
  
   I don't see it as a pitfall, it is an has the advantage of
   transparency.
  
  
There is exactly the same problem with object properties. If the
app
changes
to expect a new property on all objects stored, then the app has
to
correctly deal with the update.
  
   If a requested property doesn't exist, I assume the API will fail
   immediately with a clear error code.  It will not fail silently and
   mysteriously with no error code.  (Again, I haven't looked at it
   closely, or tried to use it.)
  
   What if the new version uses the same property name for a different
   thing?
   For example in V1 'Employer' is a string name, and in V2 'Employer'
   is a
   reference to another object. You may say 'you should change the
   column
   name'? Right thats just the same as me saying you should change the
   DB
   version number when you change the collation algorithm. Its the same
   thing.
   People seem to be making a big fuss about having a non-persisted
   collation
   function defined in user code, when many many things require the code
   to
   have the correct model of the data stored in the database to work
   properly.
   It seems illogical to make a special case for this function, and not
   do
   anything about all the other cases. IMHO either the database should
   have
   a
   stored schema, or it should not. If IndexedDB is going the direction
   of
   not
   having a stored schema, then the designers should have the confidence
   in
   their decision to stick with it and at least produce something with a
   consistent approach to the problem.
  
  
2) making things easy for the user - for me a simpler more
predictable
API
is better for the user. Having a function stored inside the
database
is
bad,
because you cannot see what function might be stored in there...
  
   We could let you query the stored function.
  
   Why would you need to read it. Every time you open the database you
   would
   need to check the function is the one you expect. The code would have
   to
   contain the function so it can compare it with the one in the DB and
   update
   it if necessary. If the code contains the function there are two
   copies
   of
   the function, one in the database and one in the code? which one is
   correct?
   which one is it using? So sometimes you will write the new function
   to
   the
   database, and sometimes you will not? More paths to test in code
   coverage,
   more complexity. Its simpler to just always set the function when
   opening
   the database.
  
  
it might be
a function from a previous version of the code and cause all sorts
of
strange bugs (which will only affect certain users with a certain
version of
the function stored in their DB).
  
   It will cause *much* less strange bugs than if you have one index
   that
   used two different collations, which is the alternative possibility.
   If the function is stored, the worst case will be that the collation
   function is out of date.  In practice, authors will mostly 

Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-06 Thread timeless
On Fri, May 6, 2011 at 2:32 AM, Jonas Sicking jo...@sicking.cc wrote:
 I'm not worried about crashes or security issues, but I am worried
 about performance. Not only is it the overhead of crossing from C++
 into JS, but also the fact that the C++ code has to go through extra
 pains to ensure that the world around it still makes sense by the time
 you come back from the JS callback. For example the callback could
 have deleted all IndexedDB databases and navigated to a new page. So
 every time you get back from JS you have to spend a bunch of time
 rechecking all the state that you were holding in your function
 implementation.

I think that a stored procedure could be considered as a compiled
version of a serialized function. i.e. something which loses its scope
chain, and which loses access to its parent object. If it loses access
to its scope chain which includes the interesting globals, it will no
longer have access to fun things like DOM objects, roughly like
DOMWorkers but with even less exciting objects available. I'd hope
that a jit should be able to do a fairly reasonable job of optimizing
such a function given these constraints.

The resulting keys could be stored with the database, so you don't
have to recalculate them while sorting, only during insertion or if
the sort key function is changed.

 All of this is totally doable. It's not even particularly hard. But it
 costs performance.



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-06 Thread Aryeh Gregor
On Thu, May 5, 2011 at 10:00 PM, Jonas Sicking jo...@sicking.cc wrote:
 We have already decided that we don't want to take on the complexity
 that comes with supporting changing collations on existing data. In
 particular it becomes very unclear what to do with data that is no
 longer unique under the new collation.

This is only an issue for unique indexes.  In MySQL, if you alter a
table such that a uniqueness constraint is violated, it will abort
with an error as soon as it detects the problem, not changing the
table.  But if you're using a non-binary collation function, you
rarely want a unique index anyway.

Still, I don't think this is needed for a first implementation of
collations.  It's something to support at some future date.

 I think ultimately we simply seem to disagree here. I think that
 supporting a standard set of collations is going to solve more than
 80% of the use cases (which is a good rule of thumb for these things)
 for version 1 as well as is easier on users and so something we'll
 ultimately will want to add anyway. Thus adding it now won't be
 painting us in a corner and it solves the majority of use cases.

 If I understand you correctly you don't think that it solves the
 majority of use cases and you think that it adds API which is bad and
 that we should never add.

 Is this a correct assessment?

For my part, I agree that supporting a high-quality, comprehensive,
standard set of collations, such as UCA with CLDR tailoring, is going
to solve much more than 80% of the use-cases.  However,

1) Versioning is a possible issue if we want full interop, since CLDR
changes often.  If browsers can't update the collation of existing
indexes, they'll be forced to either stick to one version of CLDR
forever, or carry around multiple CLDR version implementations to
account for both old and new indexes.  Moreover, if browsers do ever
update their CLDR version, we'll have different collations going by
the same name in different browsers.  One way to work around this is
to specify for a first pass that browsers must implement some specific
CLDR version, like the latest at the time the standard is published,
and then just not update it for some indefinite period.

2) If there's going to be collation support in any version, it should
be full-fledged UCA, not anything less.  Better to push off collation
support entirely to a future version than to have some simplified or
undefined collation support that will have to be maintained forever.
So if possible, support for all CLDR locales would be great; failing
that, support for just untailored UCA; failing that, binary collation
only.  Much better to allow binary collation only than to not define
the collation behavior.

3) Allowing users to specify a collation function is not needed in a
first or second draft, but could be a useful feature for the future,
so it would be worthwhile to at least keep that in mind when defining
the API.  As long as the API could be later extended to support custom
functions without too much trouble, that should be enough for now IMO.
 I'm sure there are more important things to worry about.

(Custom collation functions can be useful for things other than
natural language.  For instance,
http://en.wikipedia.org/wiki/Special:LinkSearch lets you search
external links on Wikipedia by prefix.  It supports searching for
things like *wikipedia.org, which will actually match a domain of
^.*wikipedia.org$ with any path.  This works by having an extra field
in the externallinks table containing the URL with domain names
reversed, like http://org.wikipedia.en./wiki/ instead of
http://en.wikipedia.org/wiki/, and this extra field is then indexed.
This is a waste of space, since we store the URLs twice.  In
PostgreSQL we could instead define an index based on a function
without having to create an extra column.  But as this example
illustrates, it's not essential functionality -- you can always add a
redundant column.)

On Fri, May 6, 2011 at 5:18 AM, Jonas Sicking jo...@sicking.cc wrote:
 Based on that, my conclusion is that we should go with what Pablo is
 proposing. And I think we should do it for v1.

If I understand correctly, Pablo's proposal is that the author be
allowed to specify a locale, and the browser can collate in some
undefined way based on that locale.  That sounds like a really bad
idea for interop.  If non-binary collation is supported in a first
version, it should be either

1) Two choices, binary or UCA 6.0.0.  (AFAIK, UCA gives fairly good
results for most languages even without tailoring, so it might be just
fine for v1.  It's vastly better than binary, for sure.)

2) In addition to binary and UCA 6.0.0, allow UCA 6.0.0 tailored by
any of the locales defined by CLDR 1.9.1.

There also needs to be some thought put into how to handle version
updates, since browsers cannot update their UCA or CLDR implementation
without rebuilding all existing indexes that used it (unless they keep
the old implementation 

Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-05 Thread Aryeh Gregor
On Thu, May 5, 2011 at 2:12 AM, Keean Schupke ke...@fry-it.com wrote:
 What if the new version uses the same property name for a different thing?

Yes, obviously it's going to be possible for code changes to cause
hard-to-catch bugs due to not updating the database correctly.  We
don't have to add more cases where that's possible than necessary,
without good reason.  Maybe there's good reason here, but the added
potential for error can't be neglected as a cost.

 Why would you need to read it. Every time you open the database you would
 need to check the function is the one you expect.

Not if you never intend to change it, or don't care if it's outdated.
I expect this to be the most common case.

Consider the case of someone using CLDR-tailored UCA and a new version
comes out.  You want to use the newest version for new indexes, if
multiple versions are available, but there's no pressing need to
automatically update existing indexes.  The old version is almost
certainly good enough, unless your users use obscure languages.  So in
my scheme, you can just update the function in your code and do
nothing else.  In your scheme, you'd have to either stick to the old
version across the board, or include both versions in your code
indefinitely and include out-of-band logic to choose between them, or
write a script that rebuilds the whole index on update (which would
take a long time for a large index).

 The code would have to
 contain the function so it can compare it with the one in the DB and update
 it if necessary. If the code contains the function there are two copies of
 the function, one in the database and one in the code? which one is correct?
 which one is it using? So sometimes you will write the new function to the
 database, and sometimes you will not? More paths to test in code coverage,
 more complexity. Its simpler to just always set the function when opening
 the database.

If the collation function is stored in the database, then I'd expect
setting the function to rebuild the index if the new and old functions
differ.  This could happen as a background operation, with the
existing index still usable (with the old collation function) in the
meantime.  So if you always wanted collations up-to-date, in my scheme
authors could just set the function every time they open the database,
as with your scheme.  But this could trigger a silent rebuild whenever
necessary, so the author doesn't have to worry about it.  In your
scheme, the author has to do the rebuild himself, and if he gets it
wrong, the index will be corrupted.

So as I see it, my approach is easier to use across the board.  It
lets you not update collations on old tables without requiring you to
keep track of multiple collation function versions, and it also
potentially lets you update collations on old tables to the latest
versions with rebuilding done for you in the background.  Critically,
it does not let you change a sort function without rebuilding, since
that will always cause bugs and you never want to do it (to a first
approximation).

Of course, maybe an initial implementation wouldn't do rebuilds for
you, to keep it simple.  Then the collation function would be
immutable after index creation, so you'd still have to do rebuilds
yourself.  But it would still be easier and safer: the old index will
still work in the interim even if you don't have the old version of
your collation function around, and you can't mess up and get a
corrupted index.

 Thinking about this a bit more. If you change the collation function you
 need to re-sort the index to make sure it will work (and avoid those strange
 bugs). Storing the function in the DB enables you to compare the function
 and only change it when you need to, thus optimising the number of re-sorts.
 That is the _only_ advantage to storing the function - as you still need to
 check the function stored is the one you expect to guarantee your code will
 run properly. So with a non-persisted function we need to sort every time we
 open to make sure the order is correct.

And this is totally impractical for even moderately large datasets.  I
assume we want this to be usable for databases of, say, a gigabyte in
size.  You're not going to read, sort, and write a gigabyte on every
database open.

(My experience tends more toward multi-gigabyte databases or bigger,
including writing code for Wikipedia, which is multi-terabyte.  So
maybe I'm biased to think about scalability more than necessary for
IndexedDB, but resorting the index on every index still sounds really
impractical to me.)

 However, if we attach a version
 number to the index, we can check the version number in out code to know if
 we need to resort the index. The simplest API for this would be:
 index.setCollation(1.1, my_collation_function);
 So the version number is checked against the index. If it is the same, the
 supplied collation function is used without re-sorting the index. If it is
 different the index order is 

Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-05 Thread Jonas Sicking
On Wed, May 4, 2011 at 1:24 PM, Keean Schupke ke...@fry-it.com wrote:


 On 4 May 2011 21:01, Jonas Sicking jo...@sicking.cc wrote:

 On Wed, May 4, 2011 at 1:10 AM, Keean Schupke ke...@fry-it.com wrote:
  On 4 May 2011 00:57, Jonas Sicking jo...@sicking.cc wrote:
 
  On Tue, May 3, 2011 at 12:19 AM, Keean Schupke ke...@fry-it.com
  wrote:
   The more I think about it, the more I want a user-specified
   comparison
   function. Efficiency should not be an issue here - the engines should
   tweek
   the JIT compiler to fix any efficiency issues. Just let the user pass
   a
   closure (remember functions are first-class in JavaScript so this is
   not
   a
   callback nor an event).
 
  I don't think we should do callbacks for the first version of
  javascript. It gets very messy since we can't rely on that the script
  function will be returning stable values.
 
  garbage in = garbage out. The programmers job is to write a correct
  comparison function. All functions have this problem. By this argument
  we
  had all better give up programming because there is a risk we may write
  a
  function that returns incorrect results.

 Browsers can certainly deal with this, and ensure that the only one
 suffering is the author of the buggy algorithm. However this comes at
 a cost in that the browser sorting algorithm can't go into infinite
 loops or crash even in the face of the most ridiculous comparison
 algorithm. In other words, the browser will likely have to use a
 slower sorting implementation in order to be robust.

 Additionally, there is a significant cost involved in transitioning
 between the C++ code implementing the sorting algorithm, and the
 javascript implemented callback. That is on top of the cost of
 implementing the comparison function in javascript. Even in the best
 JITs, there is a significant overhead to both these parts.

 So rather than repeating myself, i'll just quote myself:

  So the choice here really is between only supporting some form of
  binary sorting, or supporting a built-in set of collations. Anything
  else will have to wait for version 2 in my opinion.

 :)

 / Jonas

 I gave my answer, and some follow up questions in a previous email, so I am
 not avoiding the question. My point was any event handler (onMouseDown?)
 could have an infinite loop - why so fussy about this one function when so
 many others have the same problem?
 The performance point of calling to JavaScript is a valid one, but is this a
 problem? Perhaps it is fast enough. I have seen no evidence that is will be
 too slow for people to use - perhaps the bottle neck will be the disk/flash
 access speed for fetching the blocks and not the JavaScript comparison
 function.

I'm not worried about crashes or security issues, but I am worried
about performance. Not only is it the overhead of crossing from C++
into JS, but also the fact that the C++ code has to go through extra
pains to ensure that the world around it still makes sense by the time
you come back from the JS callback. For example the callback could
have deleted all IndexedDB databases and navigated to a new page. So
every time you get back from JS you have to spend a bunch of time
rechecking all the state that you were holding in your function
implementation.

All of this is totally doable. It's not even particularly hard. But it
costs performance.

/ Jonas



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-05 Thread Jonas Sicking
On Wed, May 4, 2011 at 11:12 PM, Keean Schupke ke...@fry-it.com wrote:
 On 5 May 2011 00:33, Aryeh Gregor simetrical+...@gmail.com wrote:

 On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc wrote:
  I don't think we should do callbacks for the first version of
  javascript. It gets very messy since we can't rely on that the script
  function will be returning stable values.

 The worst that would happen if it didn't return stable values is that
 sorting would return unpredictable results.

 Worst is an infinite loop - no return.


  So the choice here really is between only supporting some form of
  binary sorting, or supporting a built-in set of collations. Anything
  else will have to wait for version 2 in my opinion.

 I think it would be a mistake to try supporting a limited set of
 natural-language collations.  Binary collation is fine for a first
 version.  MySQL only supported binary collation up through version 4,
 for instance.

 A good point about MySQL.


 On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com wrote:
  I thought only the app that created the db could open it (for security
  reasons)... so it becomes the app's responsibility to do version
  control.
  The comparison function is not going to change by itself - someone has
  to go
  into the code and change it, when they do that they should up the
  revision
  of the database, if that change is incompatible.

 Why should we let such a pitfall exist if we can just store the
 function and avoid the issue?

 I don't see it as a pitfall, it is an has the advantage of transparency.


  There is exactly the same problem with object properties. If the app
  changes
  to expect a new property on all objects stored, then the app has to
  correctly deal with the update.

 If a requested property doesn't exist, I assume the API will fail
 immediately with a clear error code.  It will not fail silently and
 mysteriously with no error code.  (Again, I haven't looked at it
 closely, or tried to use it.)

 What if the new version uses the same property name for a different thing?
 For example in V1 'Employer' is a string name, and in V2 'Employer' is a
 reference to another object. You may say 'you should change the column
 name'? Right thats just the same as me saying you should change the DB
 version number when you change the collation algorithm. Its the same thing.
 People seem to be making a big fuss about having a non-persisted collation
 function defined in user code, when many many things require the code to
 have the correct model of the data stored in the database to work properly.
 It seems illogical to make a special case for this function, and not do
 anything about all the other cases. IMHO either the database should have a
 stored schema, or it should not. If IndexedDB is going the direction of not
 having a stored schema, then the designers should have the confidence in
 their decision to stick with it and at least produce something with a
 consistent approach to the problem.


  2) making things easy for the user - for me a simpler more predictable
  API
  is better for the user. Having a function stored inside the database is
  bad,
  because you cannot see what function might be stored in there...

 We could let you query the stored function.

 Why would you need to read it. Every time you open the database you would
 need to check the function is the one you expect. The code would have to
 contain the function so it can compare it with the one in the DB and update
 it if necessary. If the code contains the function there are two copies of
 the function, one in the database and one in the code? which one is correct?
 which one is it using? So sometimes you will write the new function to the
 database, and sometimes you will not? More paths to test in code coverage,
 more complexity. Its simpler to just always set the function when opening
 the database.


  it might be
  a function from a previous version of the code and cause all sorts of
  strange bugs (which will only affect certain users with a certain
  version of
  the function stored in their DB).

 It will cause *much* less strange bugs than if you have one index that
 used two different collations, which is the alternative possibility.
 If the function is stored, the worst case will be that the collation
 function is out of date.  In practice, authors will mostly want to use
 established collation functions like UCA and won't mind if they're out
 of date.  They'll also only very rarely have occasion to deliberately
 change the function.

 As I said, you will end up querying the function to see if it is the one you
 want to use, if you do that you may as well set it every time.
 Thinking about this a bit more. If you change the collation function you
 need to re-sort the index to make sure it will work (and avoid those strange
 bugs). Storing the function in the DB enables you to compare the function
 and only change it when you need to, thus 

Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-04 Thread Keean Schupke
On 3 May 2011 23:59, Aryeh Gregor simetrical+...@gmail.com wrote:

 On Tue, May 3, 2011 at 10:56 AM, Keean Schupke ke...@fry-it.com wrote:
  Why does it need to be persisted? I would prefer the database to be
  stateless. Obviously all users of the database need to use the same
  function.

 And if they don't use exactly the same function, maybe due to a
 transient bug, the index is silently and permanently corrupted, until
 all affected rows happen to be updated again?  That doesn't sound like
 a good idea to me.


I thought only the app that created the db could open it (for security
reasons)... so it becomes the app's responsibility to do version control.
The comparison function is not going to change by itself - someone has to go
into the code and change it, when they do that they should up the revision
of the database, if that change is incompatible.

There is exactly the same problem with object properties. If the app changes
to expect a new property on all objects stored, then the app has to
correctly deal with the update.

There are two issues here:

1) doing things correctly - there is no problem here, providing the closure
works.

2) making things easy for the user - for me a simpler more predictable API
is better for the user. Having a function stored inside the database is bad,
because you cannot see what function might be stored in there... it might be
a function from a previous version of the code and cause all sorts of
strange bugs (which will only affect certain users with a certain version of
the function stored in their DB). By having the sort function in plain sight
in the source code it is visible and readable. Yes, there is a risk that the
code is changed and the order method is different from that in the DB, which
will cause breakage, but so can a function hidden in the database. Of the
two I would always choose to have everything clearly visible in the source
code where you can check it.


Cheers,
Keean.


Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-04 Thread Keean Schupke
On 4 May 2011 00:57, Jonas Sicking jo...@sicking.cc wrote:

 On Tue, May 3, 2011 at 12:19 AM, Keean Schupke ke...@fry-it.com wrote:
  The more I think about it, the more I want a user-specified comparison
  function. Efficiency should not be an issue here - the engines should
 tweek
  the JIT compiler to fix any efficiency issues. Just let the user pass a
  closure (remember functions are first-class in JavaScript so this is not
 a
  callback nor an event).

 I don't think we should do callbacks for the first version of
 javascript. It gets very messy since we can't rely on that the script
 function will be returning stable values.

 Additionally we'd either have to ask that the callback function is
 re-registered each time the database is opened, or somehow store a
 serialized copy of the callback function in the browser so that it's
 available the next time the database is opened. Neither of these
 things have been done in other APIs in the past, so if we hold up v1
 until we solve the challenges involved I think it will delay the
 release of a stable spec.

 So the choice here really is between only supporting some form of
 binary sorting, or supporting a built-in set of collations. Anything
 else will have to wait for version 2 in my opinion.

 / Jonas


Thats fine with me, providing the other issues around collation orders are
solved. If something like the unicode algorithm is used (and if not I would
want to be convinced there is a good reason for doing something different
than everyone else) there is the issue of  what orderings are provided by
everyone (maybe DUCET + current CLDR). Then there is how often the CLDR
should be updated. Should there be a live fetch / version check every time
the DB is started (seems like a sensible route to me, where possible),
otherwise the CLDR version could be specified by the standard and updated
with each version of the standard?


Cheers,
Keean.


Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-04 Thread Keean Schupke
On 4 May 2011 00:57, Jonas Sicking jo...@sicking.cc wrote:

 On Tue, May 3, 2011 at 12:19 AM, Keean Schupke ke...@fry-it.com wrote:
  The more I think about it, the more I want a user-specified comparison
  function. Efficiency should not be an issue here - the engines should
 tweek
  the JIT compiler to fix any efficiency issues. Just let the user pass a
  closure (remember functions are first-class in JavaScript so this is not
 a
  callback nor an event).

 I don't think we should do callbacks for the first version of
 javascript. It gets very messy since we can't rely on that the script
 function will be returning stable values.



garbage in = garbage out. The programmers job is to write a correct
comparison function. All functions have this problem. By this argument we
had all better give up programming because there is a risk we may write a
function that returns incorrect results.



 Additionally we'd either have to ask that the callback function is
 re-registered each time the database is opened, or somehow store a



I still think re-registering is a non-issue. It is trivial to declare a
local open function openNameIndex than calls openIndex with the correct
callback and provide that as a software-module - either in the main code, or
in a separate JS file that can be included in each page. Modular programming
is a good thing, should be encouraged, and is the traditional software
engineering solution to this kind of problem.


serialized copy of the callback function in the browser so that it's
 available the next time the database is opened. Neither of these
 things have been done in other APIs in the past, so if we hold up v1
 until we solve the challenges involved I think it will delay the
 release of a stable spec.

 So the choice here really is between only supporting some form of
 binary sorting, or supporting a built-in set of collations. Anything
 else will have to wait for version 2 in my opinion.

 / Jonas



Cheers,
Keean.


Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-04 Thread Jonas Sicking
On Wed, May 4, 2011 at 1:10 AM, Keean Schupke ke...@fry-it.com wrote:
 On 4 May 2011 00:57, Jonas Sicking jo...@sicking.cc wrote:

 On Tue, May 3, 2011 at 12:19 AM, Keean Schupke ke...@fry-it.com wrote:
  The more I think about it, the more I want a user-specified comparison
  function. Efficiency should not be an issue here - the engines should
  tweek
  the JIT compiler to fix any efficiency issues. Just let the user pass a
  closure (remember functions are first-class in JavaScript so this is not
  a
  callback nor an event).

 I don't think we should do callbacks for the first version of
 javascript. It gets very messy since we can't rely on that the script
 function will be returning stable values.

 garbage in = garbage out. The programmers job is to write a correct
 comparison function. All functions have this problem. By this argument we
 had all better give up programming because there is a risk we may write a
 function that returns incorrect results.

Browsers can certainly deal with this, and ensure that the only one
suffering is the author of the buggy algorithm. However this comes at
a cost in that the browser sorting algorithm can't go into infinite
loops or crash even in the face of the most ridiculous comparison
algorithm. In other words, the browser will likely have to use a
slower sorting implementation in order to be robust.

Additionally, there is a significant cost involved in transitioning
between the C++ code implementing the sorting algorithm, and the
javascript implemented callback. That is on top of the cost of
implementing the comparison function in javascript. Even in the best
JITs, there is a significant overhead to both these parts.

So rather than repeating myself, i'll just quote myself:

 So the choice here really is between only supporting some form of
 binary sorting, or supporting a built-in set of collations. Anything
 else will have to wait for version 2 in my opinion.

:)

/ Jonas



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-04 Thread Keean Schupke
On 4 May 2011 21:01, Jonas Sicking jo...@sicking.cc wrote:

 On Wed, May 4, 2011 at 1:10 AM, Keean Schupke ke...@fry-it.com wrote:
  On 4 May 2011 00:57, Jonas Sicking jo...@sicking.cc wrote:
 
  On Tue, May 3, 2011 at 12:19 AM, Keean Schupke ke...@fry-it.com
 wrote:
   The more I think about it, the more I want a user-specified comparison
   function. Efficiency should not be an issue here - the engines should
   tweek
   the JIT compiler to fix any efficiency issues. Just let the user pass
 a
   closure (remember functions are first-class in JavaScript so this is
 not
   a
   callback nor an event).
 
  I don't think we should do callbacks for the first version of
  javascript. It gets very messy since we can't rely on that the script
  function will be returning stable values.
 
  garbage in = garbage out. The programmers job is to write a correct
  comparison function. All functions have this problem. By this argument we
  had all better give up programming because there is a risk we may write a
  function that returns incorrect results.

 Browsers can certainly deal with this, and ensure that the only one
 suffering is the author of the buggy algorithm. However this comes at
 a cost in that the browser sorting algorithm can't go into infinite
 loops or crash even in the face of the most ridiculous comparison
 algorithm. In other words, the browser will likely have to use a
 slower sorting implementation in order to be robust.

 Additionally, there is a significant cost involved in transitioning
 between the C++ code implementing the sorting algorithm, and the
 javascript implemented callback. That is on top of the cost of
 implementing the comparison function in javascript. Even in the best
 JITs, there is a significant overhead to both these parts.

 So rather than repeating myself, i'll just quote myself:

  So the choice here really is between only supporting some form of
  binary sorting, or supporting a built-in set of collations. Anything
  else will have to wait for version 2 in my opinion.

 :)

 / Jonas


I gave my answer, and some follow up questions in a previous email, so I am
not avoiding the question. My point was any event handler (onMouseDown?)
could have an infinite loop - why so fussy about this one function when so
many others have the same problem?

The performance point of calling to JavaScript is a valid one, but is this a
problem? Perhaps it is fast enough. I have seen no evidence that is will be
too slow for people to use - perhaps the bottle neck will be the disk/flash
access speed for fetching the blocks and not the JavaScript comparison
function.


Cheers,
Keean.


Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-04 Thread Aryeh Gregor
On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc wrote:
 I don't think we should do callbacks for the first version of
 javascript. It gets very messy since we can't rely on that the script
 function will be returning stable values.

The worst that would happen if it didn't return stable values is that
sorting would return unpredictable results.

 So the choice here really is between only supporting some form of
 binary sorting, or supporting a built-in set of collations. Anything
 else will have to wait for version 2 in my opinion.

I think it would be a mistake to try supporting a limited set of
natural-language collations.  Binary collation is fine for a first
version.  MySQL only supported binary collation up through version 4,
for instance.

On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com wrote:
 I thought only the app that created the db could open it (for security
 reasons)... so it becomes the app's responsibility to do version control.
 The comparison function is not going to change by itself - someone has to go
 into the code and change it, when they do that they should up the revision
 of the database, if that change is incompatible.

Why should we let such a pitfall exist if we can just store the
function and avoid the issue?

 There is exactly the same problem with object properties. If the app changes
 to expect a new property on all objects stored, then the app has to
 correctly deal with the update.

If a requested property doesn't exist, I assume the API will fail
immediately with a clear error code.  It will not fail silently and
mysteriously with no error code.  (Again, I haven't looked at it
closely, or tried to use it.)

 2) making things easy for the user - for me a simpler more predictable API
 is better for the user. Having a function stored inside the database is bad,
 because you cannot see what function might be stored in there...

We could let you query the stored function.

 it might be
 a function from a previous version of the code and cause all sorts of
 strange bugs (which will only affect certain users with a certain version of
 the function stored in their DB).

It will cause *much* less strange bugs than if you have one index that
used two different collations, which is the alternative possibility.
If the function is stored, the worst case will be that the collation
function is out of date.  In practice, authors will mostly want to use
established collation functions like UCA and won't mind if they're out
of date.  They'll also only very rarely have occasion to deliberately
change the function.

On Wed, May 4, 2011 at 4:01 PM, Jonas Sicking jo...@sicking.cc wrote:
 Browsers can certainly deal with this, and ensure that the only one
 suffering is the author of the buggy algorithm. However this comes at
 a cost in that the browser sorting algorithm can't go into infinite
 loops or crash even in the face of the most ridiculous comparison
 algorithm. In other words, the browser will likely have to use a
 slower sorting implementation in order to be robust.

The browser will only run the function once every time the given field
changes, and change the value used in the index if it's different from
the current one.  The actual sorting will still be binary, just with a
user-provided key.  So there's no possibility of especially bad
effects if you're given a bad function.  You're only running it once
per value, so it's no worse than any other function that's run a bunch
of times.

We aren't talking about a sort()-style comparison function that
returns -1 or 0 or 1.  We're talking about a function that takes a
string as input, and outputs a string to be used in the index as the
key for the object in question.  I guess you *could* also do it as a
comparison function too -- would probably be easier to write, but also
a lot easier to get badly wrong, and you'd have to do a bunch of
function calls on insert or update instead of just one.

 Additionally, there is a significant cost involved in transitioning
 between the C++ code implementing the sorting algorithm, and the
 javascript implemented callback. That is on top of the cost of
 implementing the comparison function in javascript. Even in the best
 JITs, there is a significant overhead to both these parts.

It would only have to be run once per row (object?) modified.  Not run
at all for reads.  Would that really be so bad?  Also, most authors
would be content with built-in CLDR-based sort functions, which could
be C++.



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-03 Thread Keean Schupke
The more I think about it, the more I want a user-specified comparison
function. Efficiency should not be an issue here - the engines should tweek
the JIT compiler to fix any efficiency issues. Just let the user pass a
closure (remember functions are first-class in JavaScript so this is not a
callback nor an event).


Keean.


On 2 May 2011 19:57, Aryeh Gregor simetrical+...@gmail.com wrote:

 On Fri, Apr 29, 2011 at 3:19 PM, Keean Schupke ke...@fry-it.com wrote:
  As long as we have a binary mode I am happy.

 Something I didn't think to mention: what exactly is binary mode for
 DOMStrings?  I guess it means you encode as big-endian UTF-16, then
 sort bytewise?  This is kind of evil, but it matches what sort() does,
 so I guess it should be the required behavior.  (It's kind of evil
 because it doesn't match code-point order, unlike if you encoded as
 UTF-8.  E.g., U+1 is encoded as 0xd800dc00 and U+E000 is 0xe000,
 so U+E000 sorts after U+1.)

 Perhaps this should be spelled out more clearly in the spec.



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-03 Thread Aryeh Gregor
On Tue, May 3, 2011 at 3:19 AM, Keean Schupke ke...@fry-it.com wrote:
 The more I think about it, the more I want a user-specified comparison
 function. Efficiency should not be an issue here - the engines should tweek
 the JIT compiler to fix any efficiency issues. Just let the user pass a
 closure (remember functions are first-class in JavaScript so this is not a
 callback nor an event).

Wouldn't it be a bit more complicated than just passing a regular
closure?  The function has to be persisted in the database across page
views, but a JavaScript closure is going to contain references to all
sorts of objects (like document, or local variables) that are very
specific to the current page view.  It makes no sense to persist those
objects in general.  You'd need to serialize the function somehow,
possibly putting restrictions on the sorts of variables it can access,
so that it can be sensibly restored later.  Is there some established
way of doing this yet in JavaScript?  It might be useful in other
contexts too.

I still agree that this is the correct direction to go in, though.



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-03 Thread Keean Schupke
Why does it need to be persisted? I would prefer the database to be
stateless. Obviously all users of the database need to use the same
function. I would recommend modular programming - create a .js script you
can include in all pages that provides 'collated' versions of the method
calls by adding the collation argument - Infact for good programming in
general make this API your model, so if you were writing a shopping cart,
this '.js' would provide methods like 'addToCart', 'removeFromCart', and all
collations settings would be hidden in this layer and kept out of individual
pages, whilst not needing to be stored in the database at all.

Cheers,
Keean.


On 3 May 2011 15:27, Aryeh Gregor simetrical+...@gmail.com wrote:

 On Tue, May 3, 2011 at 3:19 AM, Keean Schupke ke...@fry-it.com wrote:
  The more I think about it, the more I want a user-specified comparison
  function. Efficiency should not be an issue here - the engines should
 tweek
  the JIT compiler to fix any efficiency issues. Just let the user pass a
  closure (remember functions are first-class in JavaScript so this is not
 a
  callback nor an event).

 Wouldn't it be a bit more complicated than just passing a regular
 closure?  The function has to be persisted in the database across page
 views, but a JavaScript closure is going to contain references to all
 sorts of objects (like document, or local variables) that are very
 specific to the current page view.  It makes no sense to persist those
 objects in general.  You'd need to serialize the function somehow,
 possibly putting restrictions on the sorts of variables it can access,
 so that it can be sensibly restored later.  Is there some established
 way of doing this yet in JavaScript?  It might be useful in other
 contexts too.

 I still agree that this is the correct direction to go in, though.



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-03 Thread Aryeh Gregor
On Tue, May 3, 2011 at 10:56 AM, Keean Schupke ke...@fry-it.com wrote:
 Why does it need to be persisted? I would prefer the database to be
 stateless. Obviously all users of the database need to use the same
 function.

And if they don't use exactly the same function, maybe due to a
transient bug, the index is silently and permanently corrupted, until
all affected rows happen to be updated again?  That doesn't sound like
a good idea to me.



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-02 Thread Keean Schupke
On Sunday, 1 May 2011, Aryeh Gregor simetrical+...@gmail.com wrote:
 On Fri, Apr 29, 2011 at 3:32 PM, Jonas Sicking jo...@sicking.cc wrote:
 I agree that we will eventually want to standardize the set of allowed
 collations. Similarly to how we'll want to standardize on one set of
 charset encodings supported. However I don't think we, in this spec
 community, have enough experience to come up with a good such set. So
 it's something that I think we should postpone for now. As I
 understand it there is work going on in this area in other groups, so
 hopefully we can lean on that work eventually.

 (Disclaimer: I never really tried to figure out how IndexedDB works,
 and I haven't seen the past discussion on this topic.  However, I know
 a decent amount about database collations in practice from my work
 with MediaWiki, which included adding collation support to category
 pages last summer on a contract with Wikimedia.  Maybe everything I'm
 saying has already been brought up before and/or everyone knows it
 and/or it's wrong, in which case I apologize in advance.)

 The Unicode Collation Algorithm is the standard here:

 http://www.unicode.org/reports/tr10/

 It's pretty stable (I think), and out of the box it provides *vastly*
 better sorting than binary sort.  Binary sort doesn't even work for
 English unless you normalize case and avoid punctuation marks, and
 it's basically useless for most non-English languages.  Some type of
 UCA support in browsers would be the way to go here.

 UCA doesn't work perfectly for all locales, though, because different
 locales sort the same strings differently (French handling of accents,
 etc.).  The standard database of locale-specific collations is CLDR:

 http://cldr.unicode.org/

 CLDR tends to have several new releases per year.  For instance, 1.9.1
 was released this March, three versions were released last year, and
 five were released in 2009.  Just looking at the release notes, it
 seems that most if not all of these releases update collation details.
  Because of how collations are actually used in databases, any change
 to the collation version will require rebuilding any index that uses
 that collation.

 I don't think it's a good idea for browsers to try packaging such
 rapidly-changing locale data.  If everyone had Chrome's release and
 support schedule, it might work okay -- if you figured out a way to
 handle updates gracefully -- but in practice, authors deal with a wide
 range of browser ages.  It's not good if every user has a different
 implementation of each collation.  Nor if browsers just use a frozen
 and obsolescent collation version.  I also don't know how realistic
 implementers would find it to ship collation support for every
 language CLDR supports -- the CLDR download is a few megabytes zipped,
 but I don't know how much of that browsers would need to ship to
 support all its tailorings.

 The general solution here would be to allow the creation of indexes
 based on a user-supplied function.  I.e., the user-supplied function
 would (in SQL terms) take the row's data as input, and output some
 binary string.  That string would be used as the key in the index,
 instead of any of the column values for the row.  PostgreSQL allows
 this, or so I've heard.  Then you could implement UCA (optionally with
 CLDR tailorings) or any other collation algorithm you liked in
 JavaScript.

 Of course, we can't expect authors to reimplement the UCA if they want
 to get decent sorting.  It would make sense for browsers to expose
 some default sort functions, but I'm not familiar enough with UCA or
 CLDR to say which ones would be best in practice.  It might make sense
 to expose some medium-level primitives that would allow authors to
 easily overlay tailoring on the basic UCA algorithm, or something.  Or
 maybe it would really make sense to expose all of CLDR's tailored
 collations.  I'm not familiar enough with the specs to say.  But for
 the sake of flexibility, allowing indexes based on user-defined
 functions is the way to go.  (They're useful for things other than
 collations, too.)

 The proposed ECMAScript LocaleInfo.Collator looks like it doesn't
 currently support this use-case, since it provides only sort functions
 and not sortkey generation functions:

 http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api

 If browsers do provide sortkey generation functions based on UCA, some
 versioning mechanism will need to be used, particularly if it supports
 tailored sortkeys.


 FWIW, MySQL provides some built-in collation support, but MediaWiki
 doesn't use it, because it supports too few languages and is too
 inflexible.  MediaWiki's stock localization has 99% support for the
 500 most-used messages in 175 different languages, and the couple
 dozen locales that MySQL supports aren't acceptable for us.  Instead,
 we store everything with a binary collation, and are moving to a
 system where we compute the UCA sortkeys ourselves and put them in
 

Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-02 Thread Aryeh Gregor
On Fri, Apr 29, 2011 at 3:19 PM, Keean Schupke ke...@fry-it.com wrote:
 As long as we have a binary mode I am happy.

Something I didn't think to mention: what exactly is binary mode for
DOMStrings?  I guess it means you encode as big-endian UTF-16, then
sort bytewise?  This is kind of evil, but it matches what sort() does,
so I guess it should be the required behavior.  (It's kind of evil
because it doesn't match code-point order, unlike if you encoded as
UTF-8.  E.g., U+1 is encoded as 0xd800dc00 and U+E000 is 0xe000,
so U+E000 sorts after U+1.)

Perhaps this should be spelled out more clearly in the spec.



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-05-01 Thread Aryeh Gregor
On Fri, Apr 29, 2011 at 3:32 PM, Jonas Sicking jo...@sicking.cc wrote:
 I agree that we will eventually want to standardize the set of allowed
 collations. Similarly to how we'll want to standardize on one set of
 charset encodings supported. However I don't think we, in this spec
 community, have enough experience to come up with a good such set. So
 it's something that I think we should postpone for now. As I
 understand it there is work going on in this area in other groups, so
 hopefully we can lean on that work eventually.

(Disclaimer: I never really tried to figure out how IndexedDB works,
and I haven't seen the past discussion on this topic.  However, I know
a decent amount about database collations in practice from my work
with MediaWiki, which included adding collation support to category
pages last summer on a contract with Wikimedia.  Maybe everything I'm
saying has already been brought up before and/or everyone knows it
and/or it's wrong, in which case I apologize in advance.)

The Unicode Collation Algorithm is the standard here:

http://www.unicode.org/reports/tr10/

It's pretty stable (I think), and out of the box it provides *vastly*
better sorting than binary sort.  Binary sort doesn't even work for
English unless you normalize case and avoid punctuation marks, and
it's basically useless for most non-English languages.  Some type of
UCA support in browsers would be the way to go here.

UCA doesn't work perfectly for all locales, though, because different
locales sort the same strings differently (French handling of accents,
etc.).  The standard database of locale-specific collations is CLDR:

http://cldr.unicode.org/

CLDR tends to have several new releases per year.  For instance, 1.9.1
was released this March, three versions were released last year, and
five were released in 2009.  Just looking at the release notes, it
seems that most if not all of these releases update collation details.
 Because of how collations are actually used in databases, any change
to the collation version will require rebuilding any index that uses
that collation.

I don't think it's a good idea for browsers to try packaging such
rapidly-changing locale data.  If everyone had Chrome's release and
support schedule, it might work okay -- if you figured out a way to
handle updates gracefully -- but in practice, authors deal with a wide
range of browser ages.  It's not good if every user has a different
implementation of each collation.  Nor if browsers just use a frozen
and obsolescent collation version.  I also don't know how realistic
implementers would find it to ship collation support for every
language CLDR supports -- the CLDR download is a few megabytes zipped,
but I don't know how much of that browsers would need to ship to
support all its tailorings.

The general solution here would be to allow the creation of indexes
based on a user-supplied function.  I.e., the user-supplied function
would (in SQL terms) take the row's data as input, and output some
binary string.  That string would be used as the key in the index,
instead of any of the column values for the row.  PostgreSQL allows
this, or so I've heard.  Then you could implement UCA (optionally with
CLDR tailorings) or any other collation algorithm you liked in
JavaScript.

Of course, we can't expect authors to reimplement the UCA if they want
to get decent sorting.  It would make sense for browsers to expose
some default sort functions, but I'm not familiar enough with UCA or
CLDR to say which ones would be best in practice.  It might make sense
to expose some medium-level primitives that would allow authors to
easily overlay tailoring on the basic UCA algorithm, or something.  Or
maybe it would really make sense to expose all of CLDR's tailored
collations.  I'm not familiar enough with the specs to say.  But for
the sake of flexibility, allowing indexes based on user-defined
functions is the way to go.  (They're useful for things other than
collations, too.)

The proposed ECMAScript LocaleInfo.Collator looks like it doesn't
currently support this use-case, since it provides only sort functions
and not sortkey generation functions:

http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api

If browsers do provide sortkey generation functions based on UCA, some
versioning mechanism will need to be used, particularly if it supports
tailored sortkeys.


FWIW, MySQL provides some built-in collation support, but MediaWiki
doesn't use it, because it supports too few languages and is too
inflexible.  MediaWiki's stock localization has 99% support for the
500 most-used messages in 175 different languages, and the couple
dozen locales that MySQL supports aren't acceptable for us.  Instead,
we store everything with a binary collation, and are moving to a
system where we compute the UCA sortkeys ourselves and put them in
their own column, which we use for sorting.  MediaWiki's i18n people
can be reached in #mediawiki-i18n on freenode or the Mediawiki-i18n
list 

Re: [IndexedDB] Closing on bug 9903 (collations)

2011-04-29 Thread Jonas Sicking
On Fri, Apr 29, 2011 at 11:16 AM, Pablo Castro
pablo.cas...@microsoft.com wrote:
 We've had quite a bit of debate on this but I don't think we've reached 
 closure. At this point I would be fine with either one of a) postpone to v2 
 and agree that for now we'll just do binary collation everywhere or b) the 
 last form of the proposal sent around: extra collation argument (following 
 BCP47 plus whatever the UA wants to allow) in createObjectStore/createIndex, 
 plus a collation property to interrogate it; no way to change the collation 
 of a store/index once created.

 Given that this turned out to be a more elaborate topic than I had originally 
 expected and that it doesn't seem to have a lot of traction right now, my 
 preference would be to postpone to v2. Thoughts? Once we make a call I'll 
 make sure the spec reflects it.

I'd be fine with postponing it. However I don't think that the counter
proposals that we've received will work, so I don't think that there
is a reason to postpone.

/ Jonas



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-04-29 Thread Keean Schupke
On Friday, 29 April 2011, Jonas Sicking jo...@sicking.cc wrote:
 On Fri, Apr 29, 2011 at 11:16 AM, Pablo Castro
 pablo.cas...@microsoft.com wrote:
 We've had quite a bit of debate on this but I don't think we've reached 
 closure. At this point I would be fine with either one of a) postpone to v2 
 and agree that for now we'll just do binary collation everywhere or b) the 
 last form of the proposal sent around: extra collation argument (following 
 BCP47 plus whatever the UA wants to allow) in createObjectStore/createIndex, 
 plus a collation property to interrogate it; no way to change the collation 
 of a store/index once created.

 Given that this turned out to be a more elaborate topic than I had 
 originally expected and that it doesn't seem to have a lot of traction right 
 now, my preference would be to postpone to v2. Thoughts? Once we make a call 
 I'll make sure the spec reflects it.

 I'd be fine with postponing it. However I don't think that the counter
 proposals that we've received will work, so I don't think that there
 is a reason to postpone.

 / Jonas



As long as we have a binary mode I am happy. If it is to support other
collations, then all browsers must support the same set of options.
The question then becomes what set of collation modes to standardise
on? Allowing non standard collations will result in apps that will
only run correctly on one browser, and that does not seem a good idea
to me.

Cheers,
Keean.



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-04-29 Thread Jonas Sicking
On Fri, Apr 29, 2011 at 12:19 PM, Keean Schupke ke...@fry-it.com wrote:
 On Friday, 29 April 2011, Jonas Sicking jo...@sicking.cc wrote:
 On Fri, Apr 29, 2011 at 11:16 AM, Pablo Castro
 pablo.cas...@microsoft.com wrote:
 We've had quite a bit of debate on this but I don't think we've reached 
 closure. At this point I would be fine with either one of a) postpone to v2 
 and agree that for now we'll just do binary collation everywhere or b) the 
 last form of the proposal sent around: extra collation argument 
 (following BCP47 plus whatever the UA wants to allow) in 
 createObjectStore/createIndex, plus a collation property to interrogate it; 
 no way to change the collation of a store/index once created.

 Given that this turned out to be a more elaborate topic than I had 
 originally expected and that it doesn't seem to have a lot of traction 
 right now, my preference would be to postpone to v2. Thoughts? Once we make 
 a call I'll make sure the spec reflects it.

 I'd be fine with postponing it. However I don't think that the counter
 proposals that we've received will work, so I don't think that there
 is a reason to postpone.

 / Jonas



 As long as we have a binary mode I am happy. If it is to support other
 collations, then all browsers must support the same set of options.
 The question then becomes what set of collation modes to standardise
 on? Allowing non standard collations will result in apps that will
 only run correctly on one browser, and that does not seem a good idea
 to me.

I agree that we will eventually want to standardize the set of allowed
collations. Similarly to how we'll want to standardize on one set of
charset encodings supported. However I don't think we, in this spec
community, have enough experience to come up with a good such set. So
it's something that I think we should postpone for now. As I
understand it there is work going on in this area in other groups, so
hopefully we can lean on that work eventually.

Of course, we still do need to have a standardized vocabulary for the
collations though.

/ Jonas



Re: [IndexedDB] Closing on bug 9903 (collations)

2011-04-29 Thread Keean Schupke
There is always something like UCA:

http://www.unicode.org/reports/tr10/

which looks interesting.

Cheers,
Keean.


On 29 April 2011 20:32, Jonas Sicking jo...@sicking.cc wrote:

 On Fri, Apr 29, 2011 at 12:19 PM, Keean Schupke ke...@fry-it.com wrote:
  On Friday, 29 April 2011, Jonas Sicking jo...@sicking.cc wrote:
  On Fri, Apr 29, 2011 at 11:16 AM, Pablo Castro
  pablo.cas...@microsoft.com wrote:
  We've had quite a bit of debate on this but I don't think we've reached
 closure. At this point I would be fine with either one of a) postpone to v2
 and agree that for now we'll just do binary collation everywhere or b) the
 last form of the proposal sent around: extra collation argument (following
 BCP47 plus whatever the UA wants to allow) in createObjectStore/createIndex,
 plus a collation property to interrogate it; no way to change the collation
 of a store/index once created.
 
  Given that this turned out to be a more elaborate topic than I had
 originally expected and that it doesn't seem to have a lot of traction right
 now, my preference would be to postpone to v2. Thoughts? Once we make a call
 I'll make sure the spec reflects it.
 
  I'd be fine with postponing it. However I don't think that the counter
  proposals that we've received will work, so I don't think that there
  is a reason to postpone.
 
  / Jonas
 
 
 
  As long as we have a binary mode I am happy. If it is to support other
  collations, then all browsers must support the same set of options.
  The question then becomes what set of collation modes to standardise
  on? Allowing non standard collations will result in apps that will
  only run correctly on one browser, and that does not seem a good idea
  to me.

 I agree that we will eventually want to standardize the set of allowed
 collations. Similarly to how we'll want to standardize on one set of
 charset encodings supported. However I don't think we, in this spec
 community, have enough experience to come up with a good such set. So
 it's something that I think we should postpone for now. As I
 understand it there is work going on in this area in other groups, so
 hopefully we can lean on that work eventually.

 Of course, we still do need to have a standardized vocabulary for the
 collations though.

 / Jonas