RE: [IndexedDB] Closing on bug 9903 (collations)
From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On Behalf Of Keean Schupke Sent: Tuesday, May 31, 2011 11:51 PM On 1 June 2011 01:37, Pablo Castro pablo.cas...@microsoft.com wrote: -Original Message- From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh Gregor Sent: Tuesday, May 31, 2011 3:49 PM On Tue, May 31, 2011 at 6:39 PM, Pablo Castro pablo.cas...@microsoft.com wrote: No, that was poor wording on my part, I keep using locale in the wrong context. I meant to have the API take a proper collation identifier. The identifier can be as specific as the caller wants it to be. The implementation could choose to not honor some specific detail if it can't handle it (to the extent that doing so is allowed by the specification of collation names), or fail because it considers that not handling a particular aspect of the collation identifier would severely deviate from the caller's expectations. I'm not sure I understand you. My personal opinion is that there should be no undefined behavior here. If authors are allowed to pass collation identifiers, the spec needs to say exactly how they're to be interpreted, so the same identifier passed to two different browsers will result in the same collation, i.e., the same strings need to sort the same cross-browser. Having only binary collation is better than having non-binary collations but not defining them, IMO. I thought BCP47 allowed implementations to drop subtags if needed. I just re-read the spec and it seems that it only allows to do that in constrained cases where you can't fit the whole name in your buffer (which wouldn't apply to the context discussed here). My first instinct is that this is quite a bit to guarantee (full consistency in collation), but it seems that that's what the spec is shooting for. Given the amount of debate on this, could we at least agree that we can do binary for v1? We can then have an open item for v2 on taking collation names and sort according to UCA or taking callbacks and such. I'm okay with supporting only binary to start with. Great. I'll still wait a bit to see what other folks think, and then update the bug one way or the other. Thanks -pablo The discussion sounds like it is headed in the right direction. Are there any issues with non-unicode encodings that need to be dealt with (HTTP headers default to ISO-8859 I think). Would people be expected to convert on read into UTF-16 strings or use typed-arrays? I asked around here and folks actually pointed out that the JavaScript spec seems to be describing exactly what we needed. Looking at here [1], section 11.8.5, the relevant fragment starting at step 4 goes: Else, both px and py are Strings a. If py is a prefix of px, return false. (A String value p is a prefix of String value q if q can be the result of concatenating p and some other String r. Note that any String is a prefix of itself, because r may be the empty String.) b. If px is a prefix of py, return true. c. Let k be the smallest nonnegative integer such that the character at position k within px is different from the character at position k within py. (There must be such a k, for neither String is a prefix of the other.) d. Let m be the integer that is the code unit value for the character at position k within px. e. Let n be the integer that is the code unit value for the character at position k within py. f. If m n, return true. Otherwise, return false. It also has a note below indicating: NOTE 2 The comparison of Strings uses a simple lexicographic ordering on sequences of code unit values. There is no attempt to use the more complex, semantically oriented definitions of character or string equality and collating order defined in the Unicode specification. Therefore String values that are canonically equal according to the Unicode standard could test as unequal. In effect this algorithm assumes that both Strings are already in normalised form. Also, note that for strings containing supplementary characters, lexicographic ordering on sequences of UTF-16 code unit values differs from that on sequences of code point values. Which is very much in line with what we've been discussing, and has the extra feature of being compatible with JavaScript order. So it looks like we could reference (or inline) this in the spec and have a fully specified order for keys with string content. Thoughts? Thanks -pablo [1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf
Re: [IndexedDB] Closing on bug 9903 (collations)
On Fri, Jun 17, 2011 at 11:43 AM, Pablo Castro pablo.cas...@microsoft.com wrote: From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On Behalf Of Keean Schupke Sent: Tuesday, May 31, 2011 11:51 PM On 1 June 2011 01:37, Pablo Castro pablo.cas...@microsoft.com wrote: -Original Message- From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh Gregor Sent: Tuesday, May 31, 2011 3:49 PM On Tue, May 31, 2011 at 6:39 PM, Pablo Castro pablo.cas...@microsoft.com wrote: No, that was poor wording on my part, I keep using locale in the wrong context. I meant to have the API take a proper collation identifier. The identifier can be as specific as the caller wants it to be. The implementation could choose to not honor some specific detail if it can't handle it (to the extent that doing so is allowed by the specification of collation names), or fail because it considers that not handling a particular aspect of the collation identifier would severely deviate from the caller's expectations. I'm not sure I understand you. My personal opinion is that there should be no undefined behavior here. If authors are allowed to pass collation identifiers, the spec needs to say exactly how they're to be interpreted, so the same identifier passed to two different browsers will result in the same collation, i.e., the same strings need to sort the same cross-browser. Having only binary collation is better than having non-binary collations but not defining them, IMO. I thought BCP47 allowed implementations to drop subtags if needed. I just re-read the spec and it seems that it only allows to do that in constrained cases where you can't fit the whole name in your buffer (which wouldn't apply to the context discussed here). My first instinct is that this is quite a bit to guarantee (full consistency in collation), but it seems that that's what the spec is shooting for. Given the amount of debate on this, could we at least agree that we can do binary for v1? We can then have an open item for v2 on taking collation names and sort according to UCA or taking callbacks and such. I'm okay with supporting only binary to start with. Great. I'll still wait a bit to see what other folks think, and then update the bug one way or the other. Thanks -pablo The discussion sounds like it is headed in the right direction. Are there any issues with non-unicode encodings that need to be dealt with (HTTP headers default to ISO-8859 I think). Would people be expected to convert on read into UTF-16 strings or use typed-arrays? I asked around here and folks actually pointed out that the JavaScript spec seems to be describing exactly what we needed. Looking at here [1], section 11.8.5, the relevant fragment starting at step 4 goes: Else, both px and py are Strings a. If py is a prefix of px, return false. (A String value p is a prefix of String value q if q can be the result of concatenating p and some other String r. Note that any String is a prefix of itself, because r may be the empty String.) b. If px is a prefix of py, return true. c. Let k be the smallest nonnegative integer such that the character at position k within px is different from the character at position k within py. (There must be such a k, for neither String is a prefix of the other.) d. Let m be the integer that is the code unit value for the character at position k within px. e. Let n be the integer that is the code unit value for the character at position k within py. f. If m n, return true. Otherwise, return false. It also has a note below indicating: NOTE 2 The comparison of Strings uses a simple lexicographic ordering on sequences of code unit values. There is no attempt to use the more complex, semantically oriented definitions of character or string equality and collating order defined in the Unicode specification. Therefore String values that are canonically equal according to the Unicode standard could test as unequal. In effect this algorithm assumes that both Strings are already in normalised form. Also, note that for strings containing supplementary characters, lexicographic ordering on sequences of UTF-16 code unit values differs from that on sequences of code point values. Which is very much in line with what we've been discussing, and has the extra feature of being compatible with JavaScript order. So it looks like we could reference (or inline) this in the spec and have a fully specified order for keys with string content. Thoughts? Sounds great! Thanks for doing the research here! / Jonas
Re: [IndexedDB] Closing on bug 9903 (collations)
On 1 June 2011 01:37, Pablo Castro pablo.cas...@microsoft.com wrote: -Original Message- From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh Gregor Sent: Tuesday, May 31, 2011 3:49 PM On Tue, May 31, 2011 at 6:39 PM, Pablo Castro pablo.cas...@microsoft.com wrote: No, that was poor wording on my part, I keep using locale in the wrong context. I meant to have the API take a proper collation identifier. The identifier can be as specific as the caller wants it to be. The implementation could choose to not honor some specific detail if it can't handle it (to the extent that doing so is allowed by the specification of collation names), or fail because it considers that not handling a particular aspect of the collation identifier would severely deviate from the caller's expectations. I'm not sure I understand you. My personal opinion is that there should be no undefined behavior here. If authors are allowed to pass collation identifiers, the spec needs to say exactly how they're to be interpreted, so the same identifier passed to two different browsers will result in the same collation, i.e., the same strings need to sort the same cross-browser. Having only binary collation is better than having non-binary collations but not defining them, IMO. I thought BCP47 allowed implementations to drop subtags if needed. I just re-read the spec and it seems that it only allows to do that in constrained cases where you can't fit the whole name in your buffer (which wouldn't apply to the context discussed here). My first instinct is that this is quite a bit to guarantee (full consistency in collation), but it seems that that's what the spec is shooting for. Given the amount of debate on this, could we at least agree that we can do binary for v1? We can then have an open item for v2 on taking collation names and sort according to UCA or taking callbacks and such. I'm okay with supporting only binary to start with. Great. I'll still wait a bit to see what other folks think, and then update the bug one way or the other. Thanks -pablo The discussion sounds like it is headed in the right direction. Are there any issues with non-unicode encodings that need to be dealt with (HTTP headers default to ISO-8859 I think). Would people be expected to convert on read into UTF-16 strings or use typed-arrays? Cheers, Keean.
RE: [IndexedDB] Closing on bug 9903 (collations)
-Original Message- From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh Gregor Sent: Friday, May 06, 2011 10:05 AM On Fri, May 6, 2011 at 5:18 AM, Jonas Sicking jo...@sicking.cc wrote: Based on that, my conclusion is that we should go with what Pablo is proposing. And I think we should do it for v1. If I understand correctly, Pablo's proposal is that the author be allowed to specify a locale, and the browser can collate in some undefined way based on that locale. That sounds like a really bad idea for interop. If non-binary collation is supported in a first version, it should be either No, that was poor wording on my part, I keep using locale in the wrong context. I meant to have the API take a proper collation identifier. The identifier can be as specific as the caller wants it to be. The implementation could choose to not honor some specific detail if it can't handle it (to the extent that doing so is allowed by the specification of collation names), or fail because it considers that not handling a particular aspect of the collation identifier would severely deviate from the caller's expectations. 1) Two choices, binary or UCA 6.0.0. (AFAIK, UCA gives fairly good results for most languages even without tailoring, so it might be just fine for v1. It's vastly better than binary, for sure.) Given the amount of debate on this, could we at least agree that we can do binary for v1? We can then have an open item for v2 on taking collation names and sort according to UCA or taking callbacks and such. 2) In addition to binary and UCA 6.0.0, allow UCA 6.0.0 tailored by any of the locales defined by CLDR 1.9.1. There also needs to be some thought put into how to handle version updates, since browsers cannot update their UCA or CLDR implementation without rebuilding all existing indexes that used it (unless they keep the old implementation forever). It might be that browsers should just stick to a fixed version for the time being (like 6.0.0 and 1.9.1), and we might decide that no further APIs are needed now to accommodate possible future switches, but at least some thought needs to be given to it. I wonder if the API (independently of when we get to this) should include the version either as part of the collation identifier or as a separate argument. This would allow UAs to support a version or two for a while, and then phase them out as they fall out of use in favor of newer ones. On consideration, I don't think user-specified sortkey functions are necessary at this stage. If collations are to be identified by strings for now, we could always overload the value to accept a function at some later date if we wanted to support that. So I wouldn't worry about that further. I agree. -pablo
Re: [IndexedDB] Closing on bug 9903 (collations)
On Tue, May 31, 2011 at 6:39 PM, Pablo Castro pablo.cas...@microsoft.com wrote: No, that was poor wording on my part, I keep using locale in the wrong context. I meant to have the API take a proper collation identifier. The identifier can be as specific as the caller wants it to be. The implementation could choose to not honor some specific detail if it can't handle it (to the extent that doing so is allowed by the specification of collation names), or fail because it considers that not handling a particular aspect of the collation identifier would severely deviate from the caller's expectations. I'm not sure I understand you. My personal opinion is that there should be no undefined behavior here. If authors are allowed to pass collation identifiers, the spec needs to say exactly how they're to be interpreted, so the same identifier passed to two different browsers will result in the same collation, i.e., the same strings need to sort the same cross-browser. Having only binary collation is better than having non-binary collations but not defining them, IMO. Given the amount of debate on this, could we at least agree that we can do binary for v1? We can then have an open item for v2 on taking collation names and sort according to UCA or taking callbacks and such. I'm okay with supporting only binary to start with.
RE: [IndexedDB] Closing on bug 9903 (collations)
-Original Message- From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh Gregor Sent: Tuesday, May 31, 2011 3:49 PM On Tue, May 31, 2011 at 6:39 PM, Pablo Castro pablo.cas...@microsoft.com wrote: No, that was poor wording on my part, I keep using locale in the wrong context. I meant to have the API take a proper collation identifier. The identifier can be as specific as the caller wants it to be. The implementation could choose to not honor some specific detail if it can't handle it (to the extent that doing so is allowed by the specification of collation names), or fail because it considers that not handling a particular aspect of the collation identifier would severely deviate from the caller's expectations. I'm not sure I understand you. My personal opinion is that there should be no undefined behavior here. If authors are allowed to pass collation identifiers, the spec needs to say exactly how they're to be interpreted, so the same identifier passed to two different browsers will result in the same collation, i.e., the same strings need to sort the same cross-browser. Having only binary collation is better than having non-binary collations but not defining them, IMO. I thought BCP47 allowed implementations to drop subtags if needed. I just re-read the spec and it seems that it only allows to do that in constrained cases where you can't fit the whole name in your buffer (which wouldn't apply to the context discussed here). My first instinct is that this is quite a bit to guarantee (full consistency in collation), but it seems that that's what the spec is shooting for. Given the amount of debate on this, could we at least agree that we can do binary for v1? We can then have an open item for v2 on taking collation names and sort according to UCA or taking callbacks and such. I'm okay with supporting only binary to start with. Great. I'll still wait a bit to see what other folks think, and then update the bug one way or the other. Thanks -pablo
Re: [IndexedDB] Closing on bug 9903 (collations)
On 5/6/2011 7:07 AM, timeless wrote: I think that a stored procedure could be considered as a compiled version of a serialized function. i.e. something which loses its scope chain, and which loses access to its parent object. If it loses access to its scope chain which includes the interesting globals, it will no longer have access to fun things like DOM objects, roughly like DOMWorkers but with even less exciting objects available. I'd hope that a jit should be able to do a fairly reasonable job of optimizing such a function given these constraints. This may be what we go with, but not in version 1. Cheers, Shawn smime.p7s Description: S/MIME Cryptographic Signature
Re: [IndexedDB] Closing on bug 9903 (collations)
On 6 May 2011 03:00, Jonas Sicking jo...@sicking.cc wrote: On Wed, May 4, 2011 at 11:12 PM, Keean Schupke ke...@fry-it.com wrote: On 5 May 2011 00:33, Aryeh Gregor simetrical+...@gmail.com wrote: On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc wrote: I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. The worst that would happen if it didn't return stable values is that sorting would return unpredictable results. Worst is an infinite loop - no return. So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. I think it would be a mistake to try supporting a limited set of natural-language collations. Binary collation is fine for a first version. MySQL only supported binary collation up through version 4, for instance. A good point about MySQL. On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com wrote: I thought only the app that created the db could open it (for security reasons)... so it becomes the app's responsibility to do version control. The comparison function is not going to change by itself - someone has to go into the code and change it, when they do that they should up the revision of the database, if that change is incompatible. Why should we let such a pitfall exist if we can just store the function and avoid the issue? I don't see it as a pitfall, it is an has the advantage of transparency. There is exactly the same problem with object properties. If the app changes to expect a new property on all objects stored, then the app has to correctly deal with the update. If a requested property doesn't exist, I assume the API will fail immediately with a clear error code. It will not fail silently and mysteriously with no error code. (Again, I haven't looked at it closely, or tried to use it.) What if the new version uses the same property name for a different thing? For example in V1 'Employer' is a string name, and in V2 'Employer' is a reference to another object. You may say 'you should change the column name'? Right thats just the same as me saying you should change the DB version number when you change the collation algorithm. Its the same thing. People seem to be making a big fuss about having a non-persisted collation function defined in user code, when many many things require the code to have the correct model of the data stored in the database to work properly. It seems illogical to make a special case for this function, and not do anything about all the other cases. IMHO either the database should have a stored schema, or it should not. If IndexedDB is going the direction of not having a stored schema, then the designers should have the confidence in their decision to stick with it and at least produce something with a consistent approach to the problem. 2) making things easy for the user - for me a simpler more predictable API is better for the user. Having a function stored inside the database is bad, because you cannot see what function might be stored in there... We could let you query the stored function. Why would you need to read it. Every time you open the database you would need to check the function is the one you expect. The code would have to contain the function so it can compare it with the one in the DB and update it if necessary. If the code contains the function there are two copies of the function, one in the database and one in the code? which one is correct? which one is it using? So sometimes you will write the new function to the database, and sometimes you will not? More paths to test in code coverage, more complexity. Its simpler to just always set the function when opening the database. it might be a function from a previous version of the code and cause all sorts of strange bugs (which will only affect certain users with a certain version of the function stored in their DB). It will cause *much* less strange bugs than if you have one index that used two different collations, which is the alternative possibility. If the function is stored, the worst case will be that the collation function is out of date. In practice, authors will mostly want to use established collation functions like UCA and won't mind if they're out of date. They'll also only very rarely have occasion to deliberately change the function. As I said, you will end up querying the function to see if it is the one you want to use, if you do that you may as well set it every time. Thinking about this a bit more. If you change the collation function you need to re-sort the
Re: [IndexedDB] Closing on bug 9903 (collations)
On 6 May 2011 00:22, Aryeh Gregor simetrical+...@gmail.com wrote: On Thu, May 5, 2011 at 2:12 AM, Keean Schupke ke...@fry-it.com wrote: What if the new version uses the same property name for a different thing? Yes, obviously it's going to be possible for code changes to cause hard-to-catch bugs due to not updating the database correctly. We don't have to add more cases where that's possible than necessary, without good reason. Maybe there's good reason here, but the added potential for error can't be neglected as a cost. I have seen many bugs in real databases due to stored procedures. Why would you need to read it. Every time you open the database you would need to check the function is the one you expect. Not if you never intend to change it, or don't care if it's outdated. I expect this to be the most common case. People don't change the language setting in an application? Consider the case of someone using CLDR-tailored UCA and a new version comes out. You want to use the newest version for new indexes, if multiple versions are available, but there's no pressing need to automatically update existing indexes. The old version is almost certainly good enough, unless your users use obscure languages. So in my scheme, you can just update the function in your code and do nothing else. In your scheme, you'd have to either stick to the old version across the board, or include both versions in your code indefinitely and include out-of-band logic to choose between them, or write a script that rebuilds the whole index on update (which would take a long time for a large index). At least then the logic to chose between collations is visible in the code, rather than hidden. This is all about transparency and making sure the programmer has control of what is happening, rather than locking them into limiting patterns, and giving them the ability to see exactly what the code will do by reading and code-reviewing it. With a stored procedure, what happens when a function you call (that is not stored) changes? The only way to be sure is to run a validation check in the index (run from beginning to end checking the order is consistent with the comparison function). That is the same whether you use stores procedures or not. The code would have to contain the function so it can compare it with the one in the DB and update it if necessary. If the code contains the function there are two copies of the function, one in the database and one in the code? which one is correct? which one is it using? So sometimes you will write the new function to the database, and sometimes you will not? More paths to test in code coverage, more complexity. Its simpler to just always set the function when opening the database. If the collation function is stored in the database, then I'd expect setting the function to rebuild the index if the new and old functions differ. This could happen as a background operation, with the existing index still usable (with the old collation function) in the meantime. So if you always wanted collations up-to-date, in my scheme authors could just set the function every time they open the database, as with your scheme. But this could trigger a silent rebuild whenever necessary, so the author doesn't have to worry about it. In your scheme, the author has to do the rebuild himself, and if he gets it wrong, the index will be corrupted. So as I see it, my approach is easier to use across the board. It lets you not update collations on old tables without requiring you to keep track of multiple collation function versions, and it also potentially lets you update collations on old tables to the latest versions with rebuilding done for you in the background. Critically, it does not let you change a sort function without rebuilding, since that will always cause bugs and you never want to do it (to a first approximation). Of course, maybe an initial implementation wouldn't do rebuilds for you, to keep it simple. Then the collation function would be immutable after index creation, so you'd still have to do rebuilds yourself. But it would still be easier and safer: the old index will still work in the interim even if you don't have the old version of your collation function around, and you can't mess up and get a corrupted index. Thinking about this a bit more. If you change the collation function you need to re-sort the index to make sure it will work (and avoid those strange bugs). Storing the function in the DB enables you to compare the function and only change it when you need to, thus optimising the number of re-sorts. That is the _only_ advantage to storing the function - as you still need to check the function stored is the one you expect to guarantee your code will run properly. So with a non-persisted function we need to sort every time we open to make sure the order is correct. And
Re: [IndexedDB] Closing on bug 9903 (collations)
On Thu, May 5, 2011 at 11:36 PM, Keean Schupke ke...@fry-it.com wrote: On 6 May 2011 03:00, Jonas Sicking jo...@sicking.cc wrote: On Wed, May 4, 2011 at 11:12 PM, Keean Schupke ke...@fry-it.com wrote: On 5 May 2011 00:33, Aryeh Gregor simetrical+...@gmail.com wrote: On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc wrote: I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. The worst that would happen if it didn't return stable values is that sorting would return unpredictable results. Worst is an infinite loop - no return. So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. I think it would be a mistake to try supporting a limited set of natural-language collations. Binary collation is fine for a first version. MySQL only supported binary collation up through version 4, for instance. A good point about MySQL. On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com wrote: I thought only the app that created the db could open it (for security reasons)... so it becomes the app's responsibility to do version control. The comparison function is not going to change by itself - someone has to go into the code and change it, when they do that they should up the revision of the database, if that change is incompatible. Why should we let such a pitfall exist if we can just store the function and avoid the issue? I don't see it as a pitfall, it is an has the advantage of transparency. There is exactly the same problem with object properties. If the app changes to expect a new property on all objects stored, then the app has to correctly deal with the update. If a requested property doesn't exist, I assume the API will fail immediately with a clear error code. It will not fail silently and mysteriously with no error code. (Again, I haven't looked at it closely, or tried to use it.) What if the new version uses the same property name for a different thing? For example in V1 'Employer' is a string name, and in V2 'Employer' is a reference to another object. You may say 'you should change the column name'? Right thats just the same as me saying you should change the DB version number when you change the collation algorithm. Its the same thing. People seem to be making a big fuss about having a non-persisted collation function defined in user code, when many many things require the code to have the correct model of the data stored in the database to work properly. It seems illogical to make a special case for this function, and not do anything about all the other cases. IMHO either the database should have a stored schema, or it should not. If IndexedDB is going the direction of not having a stored schema, then the designers should have the confidence in their decision to stick with it and at least produce something with a consistent approach to the problem. 2) making things easy for the user - for me a simpler more predictable API is better for the user. Having a function stored inside the database is bad, because you cannot see what function might be stored in there... We could let you query the stored function. Why would you need to read it. Every time you open the database you would need to check the function is the one you expect. The code would have to contain the function so it can compare it with the one in the DB and update it if necessary. If the code contains the function there are two copies of the function, one in the database and one in the code? which one is correct? which one is it using? So sometimes you will write the new function to the database, and sometimes you will not? More paths to test in code coverage, more complexity. Its simpler to just always set the function when opening the database. it might be a function from a previous version of the code and cause all sorts of strange bugs (which will only affect certain users with a certain version of the function stored in their DB). It will cause *much* less strange bugs than if you have one index that used two different collations, which is the alternative possibility. If the function is stored, the worst case will be that the collation function is out of date. In practice, authors will mostly want to use established collation functions like UCA and won't mind if they're out of date. They'll also only very rarely have occasion to deliberately change the function. As I said, you will end up querying the function to see if it is the one you want to use, if you do that you may as well set it every time.
Re: [IndexedDB] Closing on bug 9903 (collations)
On 6 May 2011 10:18, Jonas Sicking jo...@sicking.cc wrote: On Thu, May 5, 2011 at 11:36 PM, Keean Schupke ke...@fry-it.com wrote: On 6 May 2011 03:00, Jonas Sicking jo...@sicking.cc wrote: On Wed, May 4, 2011 at 11:12 PM, Keean Schupke ke...@fry-it.com wrote: On 5 May 2011 00:33, Aryeh Gregor simetrical+...@gmail.com wrote: On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc wrote: I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. The worst that would happen if it didn't return stable values is that sorting would return unpredictable results. Worst is an infinite loop - no return. So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. I think it would be a mistake to try supporting a limited set of natural-language collations. Binary collation is fine for a first version. MySQL only supported binary collation up through version 4, for instance. A good point about MySQL. On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com wrote: I thought only the app that created the db could open it (for security reasons)... so it becomes the app's responsibility to do version control. The comparison function is not going to change by itself - someone has to go into the code and change it, when they do that they should up the revision of the database, if that change is incompatible. Why should we let such a pitfall exist if we can just store the function and avoid the issue? I don't see it as a pitfall, it is an has the advantage of transparency. There is exactly the same problem with object properties. If the app changes to expect a new property on all objects stored, then the app has to correctly deal with the update. If a requested property doesn't exist, I assume the API will fail immediately with a clear error code. It will not fail silently and mysteriously with no error code. (Again, I haven't looked at it closely, or tried to use it.) What if the new version uses the same property name for a different thing? For example in V1 'Employer' is a string name, and in V2 'Employer' is a reference to another object. You may say 'you should change the column name'? Right thats just the same as me saying you should change the DB version number when you change the collation algorithm. Its the same thing. People seem to be making a big fuss about having a non-persisted collation function defined in user code, when many many things require the code to have the correct model of the data stored in the database to work properly. It seems illogical to make a special case for this function, and not do anything about all the other cases. IMHO either the database should have a stored schema, or it should not. If IndexedDB is going the direction of not having a stored schema, then the designers should have the confidence in their decision to stick with it and at least produce something with a consistent approach to the problem. 2) making things easy for the user - for me a simpler more predictable API is better for the user. Having a function stored inside the database is bad, because you cannot see what function might be stored in there... We could let you query the stored function. Why would you need to read it. Every time you open the database you would need to check the function is the one you expect. The code would have to contain the function so it can compare it with the one in the DB and update it if necessary. If the code contains the function there are two copies of the function, one in the database and one in the code? which one is correct? which one is it using? So sometimes you will write the new function to the database, and sometimes you will not? More paths to test in code coverage, more complexity. Its simpler to just always set the function when opening the database. it might be a function from a previous version of the code and cause all sorts of strange bugs (which will only affect certain users with a certain version of the function stored in their DB). It will cause *much* less strange bugs than if you have one index that used two different collations, which is the alternative possibility. If the function is stored, the worst case will be that the collation function is out of date. In practice, authors will mostly want to use established collation functions like UCA and won't mind if they're out of date. They'll also only very rarely
Re: [IndexedDB] Closing on bug 9903 (collations)
On Fri, May 6, 2011 at 4:09 AM, Keean Schupke ke...@fry-it.com wrote: On 6 May 2011 10:18, Jonas Sicking jo...@sicking.cc wrote: On Thu, May 5, 2011 at 11:36 PM, Keean Schupke ke...@fry-it.com wrote: On 6 May 2011 03:00, Jonas Sicking jo...@sicking.cc wrote: On Wed, May 4, 2011 at 11:12 PM, Keean Schupke ke...@fry-it.com wrote: On 5 May 2011 00:33, Aryeh Gregor simetrical+...@gmail.com wrote: On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc wrote: I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. The worst that would happen if it didn't return stable values is that sorting would return unpredictable results. Worst is an infinite loop - no return. So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. I think it would be a mistake to try supporting a limited set of natural-language collations. Binary collation is fine for a first version. MySQL only supported binary collation up through version 4, for instance. A good point about MySQL. On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com wrote: I thought only the app that created the db could open it (for security reasons)... so it becomes the app's responsibility to do version control. The comparison function is not going to change by itself - someone has to go into the code and change it, when they do that they should up the revision of the database, if that change is incompatible. Why should we let such a pitfall exist if we can just store the function and avoid the issue? I don't see it as a pitfall, it is an has the advantage of transparency. There is exactly the same problem with object properties. If the app changes to expect a new property on all objects stored, then the app has to correctly deal with the update. If a requested property doesn't exist, I assume the API will fail immediately with a clear error code. It will not fail silently and mysteriously with no error code. (Again, I haven't looked at it closely, or tried to use it.) What if the new version uses the same property name for a different thing? For example in V1 'Employer' is a string name, and in V2 'Employer' is a reference to another object. You may say 'you should change the column name'? Right thats just the same as me saying you should change the DB version number when you change the collation algorithm. Its the same thing. People seem to be making a big fuss about having a non-persisted collation function defined in user code, when many many things require the code to have the correct model of the data stored in the database to work properly. It seems illogical to make a special case for this function, and not do anything about all the other cases. IMHO either the database should have a stored schema, or it should not. If IndexedDB is going the direction of not having a stored schema, then the designers should have the confidence in their decision to stick with it and at least produce something with a consistent approach to the problem. 2) making things easy for the user - for me a simpler more predictable API is better for the user. Having a function stored inside the database is bad, because you cannot see what function might be stored in there... We could let you query the stored function. Why would you need to read it. Every time you open the database you would need to check the function is the one you expect. The code would have to contain the function so it can compare it with the one in the DB and update it if necessary. If the code contains the function there are two copies of the function, one in the database and one in the code? which one is correct? which one is it using? So sometimes you will write the new function to the database, and sometimes you will not? More paths to test in code coverage, more complexity. Its simpler to just always set the function when opening the database. it might be a function from a previous version of the code and cause all sorts of strange bugs (which will only affect certain users with a certain version of the function stored in their DB). It will cause *much* less strange bugs than if you have one index that used two different collations, which is the alternative possibility. If the function is stored, the worst case will be that the collation function is out of date. In practice, authors will mostly
Re: [IndexedDB] Closing on bug 9903 (collations)
On Fri, May 6, 2011 at 2:32 AM, Jonas Sicking jo...@sicking.cc wrote: I'm not worried about crashes or security issues, but I am worried about performance. Not only is it the overhead of crossing from C++ into JS, but also the fact that the C++ code has to go through extra pains to ensure that the world around it still makes sense by the time you come back from the JS callback. For example the callback could have deleted all IndexedDB databases and navigated to a new page. So every time you get back from JS you have to spend a bunch of time rechecking all the state that you were holding in your function implementation. I think that a stored procedure could be considered as a compiled version of a serialized function. i.e. something which loses its scope chain, and which loses access to its parent object. If it loses access to its scope chain which includes the interesting globals, it will no longer have access to fun things like DOM objects, roughly like DOMWorkers but with even less exciting objects available. I'd hope that a jit should be able to do a fairly reasonable job of optimizing such a function given these constraints. The resulting keys could be stored with the database, so you don't have to recalculate them while sorting, only during insertion or if the sort key function is changed. All of this is totally doable. It's not even particularly hard. But it costs performance.
Re: [IndexedDB] Closing on bug 9903 (collations)
On Thu, May 5, 2011 at 10:00 PM, Jonas Sicking jo...@sicking.cc wrote: We have already decided that we don't want to take on the complexity that comes with supporting changing collations on existing data. In particular it becomes very unclear what to do with data that is no longer unique under the new collation. This is only an issue for unique indexes. In MySQL, if you alter a table such that a uniqueness constraint is violated, it will abort with an error as soon as it detects the problem, not changing the table. But if you're using a non-binary collation function, you rarely want a unique index anyway. Still, I don't think this is needed for a first implementation of collations. It's something to support at some future date. I think ultimately we simply seem to disagree here. I think that supporting a standard set of collations is going to solve more than 80% of the use cases (which is a good rule of thumb for these things) for version 1 as well as is easier on users and so something we'll ultimately will want to add anyway. Thus adding it now won't be painting us in a corner and it solves the majority of use cases. If I understand you correctly you don't think that it solves the majority of use cases and you think that it adds API which is bad and that we should never add. Is this a correct assessment? For my part, I agree that supporting a high-quality, comprehensive, standard set of collations, such as UCA with CLDR tailoring, is going to solve much more than 80% of the use-cases. However, 1) Versioning is a possible issue if we want full interop, since CLDR changes often. If browsers can't update the collation of existing indexes, they'll be forced to either stick to one version of CLDR forever, or carry around multiple CLDR version implementations to account for both old and new indexes. Moreover, if browsers do ever update their CLDR version, we'll have different collations going by the same name in different browsers. One way to work around this is to specify for a first pass that browsers must implement some specific CLDR version, like the latest at the time the standard is published, and then just not update it for some indefinite period. 2) If there's going to be collation support in any version, it should be full-fledged UCA, not anything less. Better to push off collation support entirely to a future version than to have some simplified or undefined collation support that will have to be maintained forever. So if possible, support for all CLDR locales would be great; failing that, support for just untailored UCA; failing that, binary collation only. Much better to allow binary collation only than to not define the collation behavior. 3) Allowing users to specify a collation function is not needed in a first or second draft, but could be a useful feature for the future, so it would be worthwhile to at least keep that in mind when defining the API. As long as the API could be later extended to support custom functions without too much trouble, that should be enough for now IMO. I'm sure there are more important things to worry about. (Custom collation functions can be useful for things other than natural language. For instance, http://en.wikipedia.org/wiki/Special:LinkSearch lets you search external links on Wikipedia by prefix. It supports searching for things like *wikipedia.org, which will actually match a domain of ^.*wikipedia.org$ with any path. This works by having an extra field in the externallinks table containing the URL with domain names reversed, like http://org.wikipedia.en./wiki/ instead of http://en.wikipedia.org/wiki/, and this extra field is then indexed. This is a waste of space, since we store the URLs twice. In PostgreSQL we could instead define an index based on a function without having to create an extra column. But as this example illustrates, it's not essential functionality -- you can always add a redundant column.) On Fri, May 6, 2011 at 5:18 AM, Jonas Sicking jo...@sicking.cc wrote: Based on that, my conclusion is that we should go with what Pablo is proposing. And I think we should do it for v1. If I understand correctly, Pablo's proposal is that the author be allowed to specify a locale, and the browser can collate in some undefined way based on that locale. That sounds like a really bad idea for interop. If non-binary collation is supported in a first version, it should be either 1) Two choices, binary or UCA 6.0.0. (AFAIK, UCA gives fairly good results for most languages even without tailoring, so it might be just fine for v1. It's vastly better than binary, for sure.) 2) In addition to binary and UCA 6.0.0, allow UCA 6.0.0 tailored by any of the locales defined by CLDR 1.9.1. There also needs to be some thought put into how to handle version updates, since browsers cannot update their UCA or CLDR implementation without rebuilding all existing indexes that used it (unless they keep the old implementation
Re: [IndexedDB] Closing on bug 9903 (collations)
On Thu, May 5, 2011 at 2:12 AM, Keean Schupke ke...@fry-it.com wrote: What if the new version uses the same property name for a different thing? Yes, obviously it's going to be possible for code changes to cause hard-to-catch bugs due to not updating the database correctly. We don't have to add more cases where that's possible than necessary, without good reason. Maybe there's good reason here, but the added potential for error can't be neglected as a cost. Why would you need to read it. Every time you open the database you would need to check the function is the one you expect. Not if you never intend to change it, or don't care if it's outdated. I expect this to be the most common case. Consider the case of someone using CLDR-tailored UCA and a new version comes out. You want to use the newest version for new indexes, if multiple versions are available, but there's no pressing need to automatically update existing indexes. The old version is almost certainly good enough, unless your users use obscure languages. So in my scheme, you can just update the function in your code and do nothing else. In your scheme, you'd have to either stick to the old version across the board, or include both versions in your code indefinitely and include out-of-band logic to choose between them, or write a script that rebuilds the whole index on update (which would take a long time for a large index). The code would have to contain the function so it can compare it with the one in the DB and update it if necessary. If the code contains the function there are two copies of the function, one in the database and one in the code? which one is correct? which one is it using? So sometimes you will write the new function to the database, and sometimes you will not? More paths to test in code coverage, more complexity. Its simpler to just always set the function when opening the database. If the collation function is stored in the database, then I'd expect setting the function to rebuild the index if the new and old functions differ. This could happen as a background operation, with the existing index still usable (with the old collation function) in the meantime. So if you always wanted collations up-to-date, in my scheme authors could just set the function every time they open the database, as with your scheme. But this could trigger a silent rebuild whenever necessary, so the author doesn't have to worry about it. In your scheme, the author has to do the rebuild himself, and if he gets it wrong, the index will be corrupted. So as I see it, my approach is easier to use across the board. It lets you not update collations on old tables without requiring you to keep track of multiple collation function versions, and it also potentially lets you update collations on old tables to the latest versions with rebuilding done for you in the background. Critically, it does not let you change a sort function without rebuilding, since that will always cause bugs and you never want to do it (to a first approximation). Of course, maybe an initial implementation wouldn't do rebuilds for you, to keep it simple. Then the collation function would be immutable after index creation, so you'd still have to do rebuilds yourself. But it would still be easier and safer: the old index will still work in the interim even if you don't have the old version of your collation function around, and you can't mess up and get a corrupted index. Thinking about this a bit more. If you change the collation function you need to re-sort the index to make sure it will work (and avoid those strange bugs). Storing the function in the DB enables you to compare the function and only change it when you need to, thus optimising the number of re-sorts. That is the _only_ advantage to storing the function - as you still need to check the function stored is the one you expect to guarantee your code will run properly. So with a non-persisted function we need to sort every time we open to make sure the order is correct. And this is totally impractical for even moderately large datasets. I assume we want this to be usable for databases of, say, a gigabyte in size. You're not going to read, sort, and write a gigabyte on every database open. (My experience tends more toward multi-gigabyte databases or bigger, including writing code for Wikipedia, which is multi-terabyte. So maybe I'm biased to think about scalability more than necessary for IndexedDB, but resorting the index on every index still sounds really impractical to me.) However, if we attach a version number to the index, we can check the version number in out code to know if we need to resort the index. The simplest API for this would be: index.setCollation(1.1, my_collation_function); So the version number is checked against the index. If it is the same, the supplied collation function is used without re-sorting the index. If it is different the index order is
Re: [IndexedDB] Closing on bug 9903 (collations)
On Wed, May 4, 2011 at 1:24 PM, Keean Schupke ke...@fry-it.com wrote: On 4 May 2011 21:01, Jonas Sicking jo...@sicking.cc wrote: On Wed, May 4, 2011 at 1:10 AM, Keean Schupke ke...@fry-it.com wrote: On 4 May 2011 00:57, Jonas Sicking jo...@sicking.cc wrote: On Tue, May 3, 2011 at 12:19 AM, Keean Schupke ke...@fry-it.com wrote: The more I think about it, the more I want a user-specified comparison function. Efficiency should not be an issue here - the engines should tweek the JIT compiler to fix any efficiency issues. Just let the user pass a closure (remember functions are first-class in JavaScript so this is not a callback nor an event). I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. garbage in = garbage out. The programmers job is to write a correct comparison function. All functions have this problem. By this argument we had all better give up programming because there is a risk we may write a function that returns incorrect results. Browsers can certainly deal with this, and ensure that the only one suffering is the author of the buggy algorithm. However this comes at a cost in that the browser sorting algorithm can't go into infinite loops or crash even in the face of the most ridiculous comparison algorithm. In other words, the browser will likely have to use a slower sorting implementation in order to be robust. Additionally, there is a significant cost involved in transitioning between the C++ code implementing the sorting algorithm, and the javascript implemented callback. That is on top of the cost of implementing the comparison function in javascript. Even in the best JITs, there is a significant overhead to both these parts. So rather than repeating myself, i'll just quote myself: So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. :) / Jonas I gave my answer, and some follow up questions in a previous email, so I am not avoiding the question. My point was any event handler (onMouseDown?) could have an infinite loop - why so fussy about this one function when so many others have the same problem? The performance point of calling to JavaScript is a valid one, but is this a problem? Perhaps it is fast enough. I have seen no evidence that is will be too slow for people to use - perhaps the bottle neck will be the disk/flash access speed for fetching the blocks and not the JavaScript comparison function. I'm not worried about crashes or security issues, but I am worried about performance. Not only is it the overhead of crossing from C++ into JS, but also the fact that the C++ code has to go through extra pains to ensure that the world around it still makes sense by the time you come back from the JS callback. For example the callback could have deleted all IndexedDB databases and navigated to a new page. So every time you get back from JS you have to spend a bunch of time rechecking all the state that you were holding in your function implementation. All of this is totally doable. It's not even particularly hard. But it costs performance. / Jonas
Re: [IndexedDB] Closing on bug 9903 (collations)
On Wed, May 4, 2011 at 11:12 PM, Keean Schupke ke...@fry-it.com wrote: On 5 May 2011 00:33, Aryeh Gregor simetrical+...@gmail.com wrote: On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc wrote: I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. The worst that would happen if it didn't return stable values is that sorting would return unpredictable results. Worst is an infinite loop - no return. So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. I think it would be a mistake to try supporting a limited set of natural-language collations. Binary collation is fine for a first version. MySQL only supported binary collation up through version 4, for instance. A good point about MySQL. On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com wrote: I thought only the app that created the db could open it (for security reasons)... so it becomes the app's responsibility to do version control. The comparison function is not going to change by itself - someone has to go into the code and change it, when they do that they should up the revision of the database, if that change is incompatible. Why should we let such a pitfall exist if we can just store the function and avoid the issue? I don't see it as a pitfall, it is an has the advantage of transparency. There is exactly the same problem with object properties. If the app changes to expect a new property on all objects stored, then the app has to correctly deal with the update. If a requested property doesn't exist, I assume the API will fail immediately with a clear error code. It will not fail silently and mysteriously with no error code. (Again, I haven't looked at it closely, or tried to use it.) What if the new version uses the same property name for a different thing? For example in V1 'Employer' is a string name, and in V2 'Employer' is a reference to another object. You may say 'you should change the column name'? Right thats just the same as me saying you should change the DB version number when you change the collation algorithm. Its the same thing. People seem to be making a big fuss about having a non-persisted collation function defined in user code, when many many things require the code to have the correct model of the data stored in the database to work properly. It seems illogical to make a special case for this function, and not do anything about all the other cases. IMHO either the database should have a stored schema, or it should not. If IndexedDB is going the direction of not having a stored schema, then the designers should have the confidence in their decision to stick with it and at least produce something with a consistent approach to the problem. 2) making things easy for the user - for me a simpler more predictable API is better for the user. Having a function stored inside the database is bad, because you cannot see what function might be stored in there... We could let you query the stored function. Why would you need to read it. Every time you open the database you would need to check the function is the one you expect. The code would have to contain the function so it can compare it with the one in the DB and update it if necessary. If the code contains the function there are two copies of the function, one in the database and one in the code? which one is correct? which one is it using? So sometimes you will write the new function to the database, and sometimes you will not? More paths to test in code coverage, more complexity. Its simpler to just always set the function when opening the database. it might be a function from a previous version of the code and cause all sorts of strange bugs (which will only affect certain users with a certain version of the function stored in their DB). It will cause *much* less strange bugs than if you have one index that used two different collations, which is the alternative possibility. If the function is stored, the worst case will be that the collation function is out of date. In practice, authors will mostly want to use established collation functions like UCA and won't mind if they're out of date. They'll also only very rarely have occasion to deliberately change the function. As I said, you will end up querying the function to see if it is the one you want to use, if you do that you may as well set it every time. Thinking about this a bit more. If you change the collation function you need to re-sort the index to make sure it will work (and avoid those strange bugs). Storing the function in the DB enables you to compare the function and only change it when you need to, thus
Re: [IndexedDB] Closing on bug 9903 (collations)
On 3 May 2011 23:59, Aryeh Gregor simetrical+...@gmail.com wrote: On Tue, May 3, 2011 at 10:56 AM, Keean Schupke ke...@fry-it.com wrote: Why does it need to be persisted? I would prefer the database to be stateless. Obviously all users of the database need to use the same function. And if they don't use exactly the same function, maybe due to a transient bug, the index is silently and permanently corrupted, until all affected rows happen to be updated again? That doesn't sound like a good idea to me. I thought only the app that created the db could open it (for security reasons)... so it becomes the app's responsibility to do version control. The comparison function is not going to change by itself - someone has to go into the code and change it, when they do that they should up the revision of the database, if that change is incompatible. There is exactly the same problem with object properties. If the app changes to expect a new property on all objects stored, then the app has to correctly deal with the update. There are two issues here: 1) doing things correctly - there is no problem here, providing the closure works. 2) making things easy for the user - for me a simpler more predictable API is better for the user. Having a function stored inside the database is bad, because you cannot see what function might be stored in there... it might be a function from a previous version of the code and cause all sorts of strange bugs (which will only affect certain users with a certain version of the function stored in their DB). By having the sort function in plain sight in the source code it is visible and readable. Yes, there is a risk that the code is changed and the order method is different from that in the DB, which will cause breakage, but so can a function hidden in the database. Of the two I would always choose to have everything clearly visible in the source code where you can check it. Cheers, Keean.
Re: [IndexedDB] Closing on bug 9903 (collations)
On 4 May 2011 00:57, Jonas Sicking jo...@sicking.cc wrote: On Tue, May 3, 2011 at 12:19 AM, Keean Schupke ke...@fry-it.com wrote: The more I think about it, the more I want a user-specified comparison function. Efficiency should not be an issue here - the engines should tweek the JIT compiler to fix any efficiency issues. Just let the user pass a closure (remember functions are first-class in JavaScript so this is not a callback nor an event). I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. Additionally we'd either have to ask that the callback function is re-registered each time the database is opened, or somehow store a serialized copy of the callback function in the browser so that it's available the next time the database is opened. Neither of these things have been done in other APIs in the past, so if we hold up v1 until we solve the challenges involved I think it will delay the release of a stable spec. So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. / Jonas Thats fine with me, providing the other issues around collation orders are solved. If something like the unicode algorithm is used (and if not I would want to be convinced there is a good reason for doing something different than everyone else) there is the issue of what orderings are provided by everyone (maybe DUCET + current CLDR). Then there is how often the CLDR should be updated. Should there be a live fetch / version check every time the DB is started (seems like a sensible route to me, where possible), otherwise the CLDR version could be specified by the standard and updated with each version of the standard? Cheers, Keean.
Re: [IndexedDB] Closing on bug 9903 (collations)
On 4 May 2011 00:57, Jonas Sicking jo...@sicking.cc wrote: On Tue, May 3, 2011 at 12:19 AM, Keean Schupke ke...@fry-it.com wrote: The more I think about it, the more I want a user-specified comparison function. Efficiency should not be an issue here - the engines should tweek the JIT compiler to fix any efficiency issues. Just let the user pass a closure (remember functions are first-class in JavaScript so this is not a callback nor an event). I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. garbage in = garbage out. The programmers job is to write a correct comparison function. All functions have this problem. By this argument we had all better give up programming because there is a risk we may write a function that returns incorrect results. Additionally we'd either have to ask that the callback function is re-registered each time the database is opened, or somehow store a I still think re-registering is a non-issue. It is trivial to declare a local open function openNameIndex than calls openIndex with the correct callback and provide that as a software-module - either in the main code, or in a separate JS file that can be included in each page. Modular programming is a good thing, should be encouraged, and is the traditional software engineering solution to this kind of problem. serialized copy of the callback function in the browser so that it's available the next time the database is opened. Neither of these things have been done in other APIs in the past, so if we hold up v1 until we solve the challenges involved I think it will delay the release of a stable spec. So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. / Jonas Cheers, Keean.
Re: [IndexedDB] Closing on bug 9903 (collations)
On Wed, May 4, 2011 at 1:10 AM, Keean Schupke ke...@fry-it.com wrote: On 4 May 2011 00:57, Jonas Sicking jo...@sicking.cc wrote: On Tue, May 3, 2011 at 12:19 AM, Keean Schupke ke...@fry-it.com wrote: The more I think about it, the more I want a user-specified comparison function. Efficiency should not be an issue here - the engines should tweek the JIT compiler to fix any efficiency issues. Just let the user pass a closure (remember functions are first-class in JavaScript so this is not a callback nor an event). I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. garbage in = garbage out. The programmers job is to write a correct comparison function. All functions have this problem. By this argument we had all better give up programming because there is a risk we may write a function that returns incorrect results. Browsers can certainly deal with this, and ensure that the only one suffering is the author of the buggy algorithm. However this comes at a cost in that the browser sorting algorithm can't go into infinite loops or crash even in the face of the most ridiculous comparison algorithm. In other words, the browser will likely have to use a slower sorting implementation in order to be robust. Additionally, there is a significant cost involved in transitioning between the C++ code implementing the sorting algorithm, and the javascript implemented callback. That is on top of the cost of implementing the comparison function in javascript. Even in the best JITs, there is a significant overhead to both these parts. So rather than repeating myself, i'll just quote myself: So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. :) / Jonas
Re: [IndexedDB] Closing on bug 9903 (collations)
On 4 May 2011 21:01, Jonas Sicking jo...@sicking.cc wrote: On Wed, May 4, 2011 at 1:10 AM, Keean Schupke ke...@fry-it.com wrote: On 4 May 2011 00:57, Jonas Sicking jo...@sicking.cc wrote: On Tue, May 3, 2011 at 12:19 AM, Keean Schupke ke...@fry-it.com wrote: The more I think about it, the more I want a user-specified comparison function. Efficiency should not be an issue here - the engines should tweek the JIT compiler to fix any efficiency issues. Just let the user pass a closure (remember functions are first-class in JavaScript so this is not a callback nor an event). I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. garbage in = garbage out. The programmers job is to write a correct comparison function. All functions have this problem. By this argument we had all better give up programming because there is a risk we may write a function that returns incorrect results. Browsers can certainly deal with this, and ensure that the only one suffering is the author of the buggy algorithm. However this comes at a cost in that the browser sorting algorithm can't go into infinite loops or crash even in the face of the most ridiculous comparison algorithm. In other words, the browser will likely have to use a slower sorting implementation in order to be robust. Additionally, there is a significant cost involved in transitioning between the C++ code implementing the sorting algorithm, and the javascript implemented callback. That is on top of the cost of implementing the comparison function in javascript. Even in the best JITs, there is a significant overhead to both these parts. So rather than repeating myself, i'll just quote myself: So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. :) / Jonas I gave my answer, and some follow up questions in a previous email, so I am not avoiding the question. My point was any event handler (onMouseDown?) could have an infinite loop - why so fussy about this one function when so many others have the same problem? The performance point of calling to JavaScript is a valid one, but is this a problem? Perhaps it is fast enough. I have seen no evidence that is will be too slow for people to use - perhaps the bottle neck will be the disk/flash access speed for fetching the blocks and not the JavaScript comparison function. Cheers, Keean.
Re: [IndexedDB] Closing on bug 9903 (collations)
On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking jo...@sicking.cc wrote: I don't think we should do callbacks for the first version of javascript. It gets very messy since we can't rely on that the script function will be returning stable values. The worst that would happen if it didn't return stable values is that sorting would return unpredictable results. So the choice here really is between only supporting some form of binary sorting, or supporting a built-in set of collations. Anything else will have to wait for version 2 in my opinion. I think it would be a mistake to try supporting a limited set of natural-language collations. Binary collation is fine for a first version. MySQL only supported binary collation up through version 4, for instance. On Wed, May 4, 2011 at 3:49 AM, Keean Schupke ke...@fry-it.com wrote: I thought only the app that created the db could open it (for security reasons)... so it becomes the app's responsibility to do version control. The comparison function is not going to change by itself - someone has to go into the code and change it, when they do that they should up the revision of the database, if that change is incompatible. Why should we let such a pitfall exist if we can just store the function and avoid the issue? There is exactly the same problem with object properties. If the app changes to expect a new property on all objects stored, then the app has to correctly deal with the update. If a requested property doesn't exist, I assume the API will fail immediately with a clear error code. It will not fail silently and mysteriously with no error code. (Again, I haven't looked at it closely, or tried to use it.) 2) making things easy for the user - for me a simpler more predictable API is better for the user. Having a function stored inside the database is bad, because you cannot see what function might be stored in there... We could let you query the stored function. it might be a function from a previous version of the code and cause all sorts of strange bugs (which will only affect certain users with a certain version of the function stored in their DB). It will cause *much* less strange bugs than if you have one index that used two different collations, which is the alternative possibility. If the function is stored, the worst case will be that the collation function is out of date. In practice, authors will mostly want to use established collation functions like UCA and won't mind if they're out of date. They'll also only very rarely have occasion to deliberately change the function. On Wed, May 4, 2011 at 4:01 PM, Jonas Sicking jo...@sicking.cc wrote: Browsers can certainly deal with this, and ensure that the only one suffering is the author of the buggy algorithm. However this comes at a cost in that the browser sorting algorithm can't go into infinite loops or crash even in the face of the most ridiculous comparison algorithm. In other words, the browser will likely have to use a slower sorting implementation in order to be robust. The browser will only run the function once every time the given field changes, and change the value used in the index if it's different from the current one. The actual sorting will still be binary, just with a user-provided key. So there's no possibility of especially bad effects if you're given a bad function. You're only running it once per value, so it's no worse than any other function that's run a bunch of times. We aren't talking about a sort()-style comparison function that returns -1 or 0 or 1. We're talking about a function that takes a string as input, and outputs a string to be used in the index as the key for the object in question. I guess you *could* also do it as a comparison function too -- would probably be easier to write, but also a lot easier to get badly wrong, and you'd have to do a bunch of function calls on insert or update instead of just one. Additionally, there is a significant cost involved in transitioning between the C++ code implementing the sorting algorithm, and the javascript implemented callback. That is on top of the cost of implementing the comparison function in javascript. Even in the best JITs, there is a significant overhead to both these parts. It would only have to be run once per row (object?) modified. Not run at all for reads. Would that really be so bad? Also, most authors would be content with built-in CLDR-based sort functions, which could be C++.
Re: [IndexedDB] Closing on bug 9903 (collations)
The more I think about it, the more I want a user-specified comparison function. Efficiency should not be an issue here - the engines should tweek the JIT compiler to fix any efficiency issues. Just let the user pass a closure (remember functions are first-class in JavaScript so this is not a callback nor an event). Keean. On 2 May 2011 19:57, Aryeh Gregor simetrical+...@gmail.com wrote: On Fri, Apr 29, 2011 at 3:19 PM, Keean Schupke ke...@fry-it.com wrote: As long as we have a binary mode I am happy. Something I didn't think to mention: what exactly is binary mode for DOMStrings? I guess it means you encode as big-endian UTF-16, then sort bytewise? This is kind of evil, but it matches what sort() does, so I guess it should be the required behavior. (It's kind of evil because it doesn't match code-point order, unlike if you encoded as UTF-8. E.g., U+1 is encoded as 0xd800dc00 and U+E000 is 0xe000, so U+E000 sorts after U+1.) Perhaps this should be spelled out more clearly in the spec.
Re: [IndexedDB] Closing on bug 9903 (collations)
On Tue, May 3, 2011 at 3:19 AM, Keean Schupke ke...@fry-it.com wrote: The more I think about it, the more I want a user-specified comparison function. Efficiency should not be an issue here - the engines should tweek the JIT compiler to fix any efficiency issues. Just let the user pass a closure (remember functions are first-class in JavaScript so this is not a callback nor an event). Wouldn't it be a bit more complicated than just passing a regular closure? The function has to be persisted in the database across page views, but a JavaScript closure is going to contain references to all sorts of objects (like document, or local variables) that are very specific to the current page view. It makes no sense to persist those objects in general. You'd need to serialize the function somehow, possibly putting restrictions on the sorts of variables it can access, so that it can be sensibly restored later. Is there some established way of doing this yet in JavaScript? It might be useful in other contexts too. I still agree that this is the correct direction to go in, though.
Re: [IndexedDB] Closing on bug 9903 (collations)
Why does it need to be persisted? I would prefer the database to be stateless. Obviously all users of the database need to use the same function. I would recommend modular programming - create a .js script you can include in all pages that provides 'collated' versions of the method calls by adding the collation argument - Infact for good programming in general make this API your model, so if you were writing a shopping cart, this '.js' would provide methods like 'addToCart', 'removeFromCart', and all collations settings would be hidden in this layer and kept out of individual pages, whilst not needing to be stored in the database at all. Cheers, Keean. On 3 May 2011 15:27, Aryeh Gregor simetrical+...@gmail.com wrote: On Tue, May 3, 2011 at 3:19 AM, Keean Schupke ke...@fry-it.com wrote: The more I think about it, the more I want a user-specified comparison function. Efficiency should not be an issue here - the engines should tweek the JIT compiler to fix any efficiency issues. Just let the user pass a closure (remember functions are first-class in JavaScript so this is not a callback nor an event). Wouldn't it be a bit more complicated than just passing a regular closure? The function has to be persisted in the database across page views, but a JavaScript closure is going to contain references to all sorts of objects (like document, or local variables) that are very specific to the current page view. It makes no sense to persist those objects in general. You'd need to serialize the function somehow, possibly putting restrictions on the sorts of variables it can access, so that it can be sensibly restored later. Is there some established way of doing this yet in JavaScript? It might be useful in other contexts too. I still agree that this is the correct direction to go in, though.
Re: [IndexedDB] Closing on bug 9903 (collations)
On Tue, May 3, 2011 at 10:56 AM, Keean Schupke ke...@fry-it.com wrote: Why does it need to be persisted? I would prefer the database to be stateless. Obviously all users of the database need to use the same function. And if they don't use exactly the same function, maybe due to a transient bug, the index is silently and permanently corrupted, until all affected rows happen to be updated again? That doesn't sound like a good idea to me.
Re: [IndexedDB] Closing on bug 9903 (collations)
On Sunday, 1 May 2011, Aryeh Gregor simetrical+...@gmail.com wrote: On Fri, Apr 29, 2011 at 3:32 PM, Jonas Sicking jo...@sicking.cc wrote: I agree that we will eventually want to standardize the set of allowed collations. Similarly to how we'll want to standardize on one set of charset encodings supported. However I don't think we, in this spec community, have enough experience to come up with a good such set. So it's something that I think we should postpone for now. As I understand it there is work going on in this area in other groups, so hopefully we can lean on that work eventually. (Disclaimer: I never really tried to figure out how IndexedDB works, and I haven't seen the past discussion on this topic. However, I know a decent amount about database collations in practice from my work with MediaWiki, which included adding collation support to category pages last summer on a contract with Wikimedia. Maybe everything I'm saying has already been brought up before and/or everyone knows it and/or it's wrong, in which case I apologize in advance.) The Unicode Collation Algorithm is the standard here: http://www.unicode.org/reports/tr10/ It's pretty stable (I think), and out of the box it provides *vastly* better sorting than binary sort. Binary sort doesn't even work for English unless you normalize case and avoid punctuation marks, and it's basically useless for most non-English languages. Some type of UCA support in browsers would be the way to go here. UCA doesn't work perfectly for all locales, though, because different locales sort the same strings differently (French handling of accents, etc.). The standard database of locale-specific collations is CLDR: http://cldr.unicode.org/ CLDR tends to have several new releases per year. For instance, 1.9.1 was released this March, three versions were released last year, and five were released in 2009. Just looking at the release notes, it seems that most if not all of these releases update collation details. Because of how collations are actually used in databases, any change to the collation version will require rebuilding any index that uses that collation. I don't think it's a good idea for browsers to try packaging such rapidly-changing locale data. If everyone had Chrome's release and support schedule, it might work okay -- if you figured out a way to handle updates gracefully -- but in practice, authors deal with a wide range of browser ages. It's not good if every user has a different implementation of each collation. Nor if browsers just use a frozen and obsolescent collation version. I also don't know how realistic implementers would find it to ship collation support for every language CLDR supports -- the CLDR download is a few megabytes zipped, but I don't know how much of that browsers would need to ship to support all its tailorings. The general solution here would be to allow the creation of indexes based on a user-supplied function. I.e., the user-supplied function would (in SQL terms) take the row's data as input, and output some binary string. That string would be used as the key in the index, instead of any of the column values for the row. PostgreSQL allows this, or so I've heard. Then you could implement UCA (optionally with CLDR tailorings) or any other collation algorithm you liked in JavaScript. Of course, we can't expect authors to reimplement the UCA if they want to get decent sorting. It would make sense for browsers to expose some default sort functions, but I'm not familiar enough with UCA or CLDR to say which ones would be best in practice. It might make sense to expose some medium-level primitives that would allow authors to easily overlay tailoring on the basic UCA algorithm, or something. Or maybe it would really make sense to expose all of CLDR's tailored collations. I'm not familiar enough with the specs to say. But for the sake of flexibility, allowing indexes based on user-defined functions is the way to go. (They're useful for things other than collations, too.) The proposed ECMAScript LocaleInfo.Collator looks like it doesn't currently support this use-case, since it provides only sort functions and not sortkey generation functions: http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api If browsers do provide sortkey generation functions based on UCA, some versioning mechanism will need to be used, particularly if it supports tailored sortkeys. FWIW, MySQL provides some built-in collation support, but MediaWiki doesn't use it, because it supports too few languages and is too inflexible. MediaWiki's stock localization has 99% support for the 500 most-used messages in 175 different languages, and the couple dozen locales that MySQL supports aren't acceptable for us. Instead, we store everything with a binary collation, and are moving to a system where we compute the UCA sortkeys ourselves and put them in
Re: [IndexedDB] Closing on bug 9903 (collations)
On Fri, Apr 29, 2011 at 3:19 PM, Keean Schupke ke...@fry-it.com wrote: As long as we have a binary mode I am happy. Something I didn't think to mention: what exactly is binary mode for DOMStrings? I guess it means you encode as big-endian UTF-16, then sort bytewise? This is kind of evil, but it matches what sort() does, so I guess it should be the required behavior. (It's kind of evil because it doesn't match code-point order, unlike if you encoded as UTF-8. E.g., U+1 is encoded as 0xd800dc00 and U+E000 is 0xe000, so U+E000 sorts after U+1.) Perhaps this should be spelled out more clearly in the spec.
Re: [IndexedDB] Closing on bug 9903 (collations)
On Fri, Apr 29, 2011 at 3:32 PM, Jonas Sicking jo...@sicking.cc wrote: I agree that we will eventually want to standardize the set of allowed collations. Similarly to how we'll want to standardize on one set of charset encodings supported. However I don't think we, in this spec community, have enough experience to come up with a good such set. So it's something that I think we should postpone for now. As I understand it there is work going on in this area in other groups, so hopefully we can lean on that work eventually. (Disclaimer: I never really tried to figure out how IndexedDB works, and I haven't seen the past discussion on this topic. However, I know a decent amount about database collations in practice from my work with MediaWiki, which included adding collation support to category pages last summer on a contract with Wikimedia. Maybe everything I'm saying has already been brought up before and/or everyone knows it and/or it's wrong, in which case I apologize in advance.) The Unicode Collation Algorithm is the standard here: http://www.unicode.org/reports/tr10/ It's pretty stable (I think), and out of the box it provides *vastly* better sorting than binary sort. Binary sort doesn't even work for English unless you normalize case and avoid punctuation marks, and it's basically useless for most non-English languages. Some type of UCA support in browsers would be the way to go here. UCA doesn't work perfectly for all locales, though, because different locales sort the same strings differently (French handling of accents, etc.). The standard database of locale-specific collations is CLDR: http://cldr.unicode.org/ CLDR tends to have several new releases per year. For instance, 1.9.1 was released this March, three versions were released last year, and five were released in 2009. Just looking at the release notes, it seems that most if not all of these releases update collation details. Because of how collations are actually used in databases, any change to the collation version will require rebuilding any index that uses that collation. I don't think it's a good idea for browsers to try packaging such rapidly-changing locale data. If everyone had Chrome's release and support schedule, it might work okay -- if you figured out a way to handle updates gracefully -- but in practice, authors deal with a wide range of browser ages. It's not good if every user has a different implementation of each collation. Nor if browsers just use a frozen and obsolescent collation version. I also don't know how realistic implementers would find it to ship collation support for every language CLDR supports -- the CLDR download is a few megabytes zipped, but I don't know how much of that browsers would need to ship to support all its tailorings. The general solution here would be to allow the creation of indexes based on a user-supplied function. I.e., the user-supplied function would (in SQL terms) take the row's data as input, and output some binary string. That string would be used as the key in the index, instead of any of the column values for the row. PostgreSQL allows this, or so I've heard. Then you could implement UCA (optionally with CLDR tailorings) or any other collation algorithm you liked in JavaScript. Of course, we can't expect authors to reimplement the UCA if they want to get decent sorting. It would make sense for browsers to expose some default sort functions, but I'm not familiar enough with UCA or CLDR to say which ones would be best in practice. It might make sense to expose some medium-level primitives that would allow authors to easily overlay tailoring on the basic UCA algorithm, or something. Or maybe it would really make sense to expose all of CLDR's tailored collations. I'm not familiar enough with the specs to say. But for the sake of flexibility, allowing indexes based on user-defined functions is the way to go. (They're useful for things other than collations, too.) The proposed ECMAScript LocaleInfo.Collator looks like it doesn't currently support this use-case, since it provides only sort functions and not sortkey generation functions: http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api If browsers do provide sortkey generation functions based on UCA, some versioning mechanism will need to be used, particularly if it supports tailored sortkeys. FWIW, MySQL provides some built-in collation support, but MediaWiki doesn't use it, because it supports too few languages and is too inflexible. MediaWiki's stock localization has 99% support for the 500 most-used messages in 175 different languages, and the couple dozen locales that MySQL supports aren't acceptable for us. Instead, we store everything with a binary collation, and are moving to a system where we compute the UCA sortkeys ourselves and put them in their own column, which we use for sorting. MediaWiki's i18n people can be reached in #mediawiki-i18n on freenode or the Mediawiki-i18n list
Re: [IndexedDB] Closing on bug 9903 (collations)
On Fri, Apr 29, 2011 at 11:16 AM, Pablo Castro pablo.cas...@microsoft.com wrote: We've had quite a bit of debate on this but I don't think we've reached closure. At this point I would be fine with either one of a) postpone to v2 and agree that for now we'll just do binary collation everywhere or b) the last form of the proposal sent around: extra collation argument (following BCP47 plus whatever the UA wants to allow) in createObjectStore/createIndex, plus a collation property to interrogate it; no way to change the collation of a store/index once created. Given that this turned out to be a more elaborate topic than I had originally expected and that it doesn't seem to have a lot of traction right now, my preference would be to postpone to v2. Thoughts? Once we make a call I'll make sure the spec reflects it. I'd be fine with postponing it. However I don't think that the counter proposals that we've received will work, so I don't think that there is a reason to postpone. / Jonas
Re: [IndexedDB] Closing on bug 9903 (collations)
On Friday, 29 April 2011, Jonas Sicking jo...@sicking.cc wrote: On Fri, Apr 29, 2011 at 11:16 AM, Pablo Castro pablo.cas...@microsoft.com wrote: We've had quite a bit of debate on this but I don't think we've reached closure. At this point I would be fine with either one of a) postpone to v2 and agree that for now we'll just do binary collation everywhere or b) the last form of the proposal sent around: extra collation argument (following BCP47 plus whatever the UA wants to allow) in createObjectStore/createIndex, plus a collation property to interrogate it; no way to change the collation of a store/index once created. Given that this turned out to be a more elaborate topic than I had originally expected and that it doesn't seem to have a lot of traction right now, my preference would be to postpone to v2. Thoughts? Once we make a call I'll make sure the spec reflects it. I'd be fine with postponing it. However I don't think that the counter proposals that we've received will work, so I don't think that there is a reason to postpone. / Jonas As long as we have a binary mode I am happy. If it is to support other collations, then all browsers must support the same set of options. The question then becomes what set of collation modes to standardise on? Allowing non standard collations will result in apps that will only run correctly on one browser, and that does not seem a good idea to me. Cheers, Keean.
Re: [IndexedDB] Closing on bug 9903 (collations)
On Fri, Apr 29, 2011 at 12:19 PM, Keean Schupke ke...@fry-it.com wrote: On Friday, 29 April 2011, Jonas Sicking jo...@sicking.cc wrote: On Fri, Apr 29, 2011 at 11:16 AM, Pablo Castro pablo.cas...@microsoft.com wrote: We've had quite a bit of debate on this but I don't think we've reached closure. At this point I would be fine with either one of a) postpone to v2 and agree that for now we'll just do binary collation everywhere or b) the last form of the proposal sent around: extra collation argument (following BCP47 plus whatever the UA wants to allow) in createObjectStore/createIndex, plus a collation property to interrogate it; no way to change the collation of a store/index once created. Given that this turned out to be a more elaborate topic than I had originally expected and that it doesn't seem to have a lot of traction right now, my preference would be to postpone to v2. Thoughts? Once we make a call I'll make sure the spec reflects it. I'd be fine with postponing it. However I don't think that the counter proposals that we've received will work, so I don't think that there is a reason to postpone. / Jonas As long as we have a binary mode I am happy. If it is to support other collations, then all browsers must support the same set of options. The question then becomes what set of collation modes to standardise on? Allowing non standard collations will result in apps that will only run correctly on one browser, and that does not seem a good idea to me. I agree that we will eventually want to standardize the set of allowed collations. Similarly to how we'll want to standardize on one set of charset encodings supported. However I don't think we, in this spec community, have enough experience to come up with a good such set. So it's something that I think we should postpone for now. As I understand it there is work going on in this area in other groups, so hopefully we can lean on that work eventually. Of course, we still do need to have a standardized vocabulary for the collations though. / Jonas
Re: [IndexedDB] Closing on bug 9903 (collations)
There is always something like UCA: http://www.unicode.org/reports/tr10/ which looks interesting. Cheers, Keean. On 29 April 2011 20:32, Jonas Sicking jo...@sicking.cc wrote: On Fri, Apr 29, 2011 at 12:19 PM, Keean Schupke ke...@fry-it.com wrote: On Friday, 29 April 2011, Jonas Sicking jo...@sicking.cc wrote: On Fri, Apr 29, 2011 at 11:16 AM, Pablo Castro pablo.cas...@microsoft.com wrote: We've had quite a bit of debate on this but I don't think we've reached closure. At this point I would be fine with either one of a) postpone to v2 and agree that for now we'll just do binary collation everywhere or b) the last form of the proposal sent around: extra collation argument (following BCP47 plus whatever the UA wants to allow) in createObjectStore/createIndex, plus a collation property to interrogate it; no way to change the collation of a store/index once created. Given that this turned out to be a more elaborate topic than I had originally expected and that it doesn't seem to have a lot of traction right now, my preference would be to postpone to v2. Thoughts? Once we make a call I'll make sure the spec reflects it. I'd be fine with postponing it. However I don't think that the counter proposals that we've received will work, so I don't think that there is a reason to postpone. / Jonas As long as we have a binary mode I am happy. If it is to support other collations, then all browsers must support the same set of options. The question then becomes what set of collation modes to standardise on? Allowing non standard collations will result in apps that will only run correctly on one browser, and that does not seem a good idea to me. I agree that we will eventually want to standardize the set of allowed collations. Similarly to how we'll want to standardize on one set of charset encodings supported. However I don't think we, in this spec community, have enough experience to come up with a good such set. So it's something that I think we should postpone for now. As I understand it there is work going on in this area in other groups, so hopefully we can lean on that work eventually. Of course, we still do need to have a standardized vocabulary for the collations though. / Jonas