-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Lars Kristan
Subject: RE: Roundtripping Solved


However, requirements 1 and 2 are actually taken from Unicode standard, they are not my requirements.
How's that? Well, they are my requirements also, but instead of "for all valid UTF-x strings", in my case the requirement is relaxed to "for all valid UTF-8 strings that do not contain the 128 replacement codepoints".

Yes, I follow that. But if you replace the phrase "128 replacement codepoints" with the phrase "128 replacement codepoint strings", or "128 replacement escape sequences" then you do actually still have a workable scheme which does the job just as well. You don't seem to have acknowledged this, but think it through.


I know you argued against replacement strings a while back for "performance reasons". I should have replied to that at the time, but I let it go. But realistically, Lars, I think you should just take the performance hit. The computing cost of counting characters in a null-terminated UTF-8 stream is really not that much more than the cost of strlen(). Think about it - all you have to do is to disregard bytes which match the bit pattern 10xxxxxx. Just count all the rest. You're talking about adding a couple of machine code instructions to the loop, that's all. Not only that, as a programmer, you /must/ surely realise that the performance cost of even the most complex UTF conversion is going to be utterly insignificant when compared with the time it takes to move the drive head from one part of a hard disc to another. Your conversions will be totally swamped out by all the snail-pace fstat()s etc. that you'll need to do to get your filenames in the first place. And even if you don't accept that, I hope you can understand that if it is suggested to the UTC that they reserve some codepoints just so you don't have to take a performance hit, the proposal won't get much past their inbox.

So let's hypothesise that you /can/ take the performance hit. In that case, escape sequences will work just as well as resevered characters. They will fulfil exactly the same function ... EXCEPT that you no longer have to worry that Unicode text might contain single codepoints by accident. Instead, you have a relaxed requirement - that Unicode text should not contain any escape strings by accident ... and that can be arranged with an utterly astronomical degree of certainty (though never /absolute/ certainty of course). I submit, therefore, again, that all of your needs will be met (possibly apart from the "no performance hit" thing) if you accept strings of characters instead of single characters. /This is workable/.



Furthermore, today, y should not contain any of the 128 codepoints (assuming UTC takes unassigned codepoints and assigns them today).

This is also true of suitably chosen escape sequences. Except that the UTC does not need to assign them - you can do that yourself - with any desired level of probability that it won't turn up by accident.



And considerably less than inability to access files or even files being displayed with missing characters (or no characters at all).

There is also one other thing which you seem not to have considered. It is possible (and /much/ more likely than that a suitably chosen escape sequence might turn up by accident) that, in some non-Unicode encoding ... let's say the fictitious encoding Krakozhian ... the byte sequence emitted by UTF-8(c) might be extremely common (where c is one of your 128 reserved codepoints). In other words, you have to forbid the byte-sequences UTF-8(c), for all 128 c's, not just in Unicode (which, granted, you could do by reserving the characters, c, assuming you could wave a magic wand at the UTC), but in ALL OTHER ENCODINGS also. It strikes me that you have no way to guarantee that.


Further, if you argue that this circumstance is unlikely enough not to bother about, then my previous arguments involving probability hold.

I hope I don't come across as arguing for the sake of arguing. I'm actually trying to help here. But you WILL NOT get your 128 codepoints, so it seems reasonable to look for other ways of solving the original problem which those codepoints were designed to solve.



One last question - why /can't/ locale conversion be automated? I don't really get this one, but it's the root of this whole topic. Surely, if we make the following assumptions:
(1) No user has a locale of UTF-8, and
(2) Some users will have created UTF-8 filenames and UTF-8 text files, and
(3) Some of those text files may have been concatenated, leading to mixed-encoding text files
then we can surely automate everything. (Requirement (1) can be met simply by asking all users who have changed their locale to UTF-8 to change it back again, temporarily). Assuming these requirements, all you have to do is:


# for (all users)
# {
# for (all filename below ~/)
# {
# if (filename not valid UTF-8)
# {
# rename it by re-encoding it (assuming it to be currently encoded in the user's locale) to UTF-8
# }
# }
# for (all files below ~/)
# {
# if (the file can be positively identified as a text file)
# {
# re-encode all non-UTF-8 substrings (assuming them to be in the user's locale) to UTF-8
# }
# }
# change the user's locale to UTF-8
# }


Kernel files and other files under / but not under /user should all have ASCII filenames and contain ASCII text, so they won't be a problem anyway. (And even that's not true, the superuser can do the same thing, taking care to avoid traversing /user). References to filenames in scripts will have been modified along with the filenames, because scripts are text files. All that would fall through would be references to non-ASCII filenames in binary files, and you can mitigate even that, at least partially - for instance by spitting out all databases into .sql files before conversion and reloading them after; recompiling as much as possible from source after the conversion; etc.. A small amount of stuff would still fall through, but that set will be so small that by now it would be pretty reasonable just to say "hell - let it break". And when it breaks, fix it. I mean - if you actually /can/ automate things, then the whole of the rest of this line of discussion becomes unnecessary.

Just my thoughts.
Jill

PS. I'm on holiday from tomorrow, so if I fail to respond to any comments, it'll be because I'm not here. :-)









Reply via email to